timbitz / Whippet.jl

Lightweight and Fast; RNA-seq quantification at the event-level
MIT License
105 stars 21 forks source link

what is forum for code questions? #121

Closed tsgouros closed 3 years ago

tsgouros commented 3 years ago

Is there a forum or mailing list somewhere for asking questions like what is the purpose of the fill!() function in process_reads() and what is the sequence it sometimes replaces in the reads object (FASTQRecord)?

(With apologies for cluttering up your issues list) Thank you.

timbitz commented 3 years ago

Hey @tsgouros, you can search for functions you don't know in the repository with github, eg. searching for fill! leads to https://github.com/timbitz/Whippet.jl/blob/master/src/record.jl which houses that function, and the struct it's filling. You can also check out the repositories of dependencies-- sometimes I override their functions for some specific functionality.

tsgouros commented 3 years ago

Thanks. I forked the repo and have been examining the code. It's easy to find the functions, and I found that one and stared at it for a while and ran it testing inputs and outputs. I just don't understand why it's doing what it's doing. The struct doesn't seem totally empty before the fill!() call, sometimes has "EMPTY SEQUENCE", sometimes has a sequence that is kept, sometimes a sequence that is replaced. I'm new to both Julia and genomics so very little is clear to me even if the code seems pretty clean (which it does, btw), was just hoping there was a forum somewhere to guide me on this kind of question.

What I'm trying to do is to measure the heterogeneity of reads that are classified as belonging to a gene, and hoping I can find a place to write out the read identifier when it is found to match a specific gene. Being new, I'm not really sure I can do this by hacking Whippet, but since the analysis the lab has already done has been with Whippet it would be convenient to be able to use the same code base.

timbitz commented 3 years ago

Hey, so yeah, in this case that struct is there to emulate an earlier version of the biosequences library that used FASTQRecord as a struct and had a fill function to parse read bytes into a biosequence, which I could specify as a two bit encoding then. So thats the purpose there— now the parser lives in FASTX.jl and has different api, which is why it might seem counter intuitive to have such a wrapper struct from a design perspective— there is some history there, but it was just more convenient this way. Just use the read after the call to fill! lol.

Anyways you can definitely do what you’re trying to I think. The alignment object returned for a read has a path of nodes, each node has a .node and a .gene entry (though there are multiple paths for multimappjng reads, so you'll need to handle that). You can find the mapped gene’s index for a path with something like ‘geneind = first(path).gene’ and then look up the gene name in the ‘lib’ variable — that is the whippet index. ‘lib.names[geneind]’ should have the gene id. You can print out the read identifier using api from FASTX.jl, and the ‘raw’ record object in the fastq struct.

That’s the gist of it, I’m on vacation, so this is just on my phone, but that should be enough to get what you’re saying to work I think. Good luck!

tsgouros commented 3 years ago

This worked great. Thank you.