Allow importing of gene names with underscores

JosephLalli commented 2 years ago

While trying to use this excellent tool on a project of mine that has a customized transcriptome, I encountered an error when using the a1/a2 suffixes, 'L' and 'R' (the default suffixes for g2gtools).

Some of my gene names are of the format "XXX_LYYY" or "XXX_RYY". Removing the suffixes from the gene results in everything after the first _L or _R being deleted, which caused errors with importing salmon results.

I've made two small changes to the import code to ensure that only text after the last '_' is removed. This solves the issue on my end.

JosephLalli commented 2 years ago

(Realized while writing the above that my code removed all text after the last \'_\' by default. This code actually ensures that text after the last \'_\' is one of the two allele suffixes.)

mikelove commented 2 years ago

Can you describe the problem again? You have genes names with _L and _R in the identifier and you are also using g2gtools with attaches an additional _L and _R? So do you have transcripts with _L..._L... etc.?

I want to explore this a bit more as its a key step.

JosephLalli commented 2 years ago

Here's a good minimal working example of a tx2gene tsv that will cause the error:

CTX.PB.8991.7_L WFIKKN1_RAB40C_L CTX.PB.8991.7_R WFIKKN1_RAB40C_R CTX.PB.8009.1_L novelGene_LINC02291_AS_L CTX.PB.8009.1_R novelGene_LINC02291_AS_R

After removal of the allele suffix, these gene names become: WFIKKN1 WFIKKN1 novelGene novelGene

Some context: I am working with a modified NCBI110 reference gtf that also includes novel gene isoforms discovered via Iso-Seq. The novel gene isoforms were lifted over to the T2T genome with Liftoff. I've encountered problems with two kinds of gene names: In the first case (CTX.PB.8991.7_L), the novel isoform have been given some odd gene names - in cases where an isoform overlaps two genes, Liftoff has automatically assigned the transcript to a gene name "Gene1_Gene2". These names improperly become "Gene1" upon removal of allele suffixes. In the second case (CTX.PB.8009.1_L), at some point in my pipeline the novel isoform has been assigned to a gene with the name "novelGene_Gene1". If Gene1 begins with an allele suffix (L or R), then everything after "novelGene" is removed.

The proposed fix changes how fishpond identifies allele suffixes by: 1) splitting the gene name on "" 2) deleting the last value of the resulting vector if it is one of the allele suffixes 3) joining the remaining vector on ''

This method preserves underscores in the gene name.

mikelove commented 2 years ago

Thanks for this example. I’ll take a look this week.

mikelove commented 2 years ago

I think I fixed by using regular expressions in the existing code, e.g. _L$ rather than previously _L. Let me know if this version fixes the bug on your end. Thanks for posting with examples, that was v helpful.

If all goes will this will be in the release next week.

mikelove commented 2 years ago

I'm going to close this PR for now, just because I'm trying to get the GitHub Pages working again, and i think it's stalled on this branch. But I'm still interested if the above fix works for you!

thelovelab / fishpond

Allow importing of gene names with underscores #29