tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
79 stars 28 forks source link

Problems with FEELnc_codpot.pl (unknown_transcript_1) #42

Closed joelnitta closed 3 years ago

joelnitta commented 3 years ago

Hi, thanks for this software!

I am getting an error with FEELnc_codpot.pl, as follows:

FEELnc_codpot.pl \
>   -i $ANALYSISDIR/feelnc/candidate_lncRNA.gtf \
>   -a $DATADIR/reference/daphnia_genome.gtf \
>   -b transcript_biotype=protein_coding \
>   -g $DATADIR/reference/reference.fasta \
>   --mode=shuffle
Possible precedence issue with control flow operator at /home/joelnitta/kato_daphnia/bin/feelnc/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
> Extract ORFs/cDNAs for mRNAs from a GTF file
Parsing file 'daphnia_genome.gtf'...
Parse input file:             [----------------------------------------------------------------------------------------------------]
    Your input GTF file 'daphnia_genome.gtf' contains *35912* transcripts
    Extracting ORFs/cDNAs 35789/35912...
    Extracted '35789' ORF/cDNAs sequences on '35912'.

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------
> The lncRNA training file is not set. Get ORFs/cDNAs for lncRNAs by shuffling mRNA sequences
Input error: Invalid input file, expecting nucleotide sequence line on line 14910

Failed to run fasta_ushuffle:
/home/joelnitta/kato_daphnia/bin/feelnc/bin/fasta_ushuffle -k 7 -s 1234 -n 3 < /tmp//144030_candidate_lncRNA.gtf.oneLine.fa > /tmp//144030_candidate_lncRNA.gtf.ushuffle.3perm.fa

As you can see, it does not extract all the transcripts successfully.

I tried to find what was causing the error by saving the intermediate output with --keeptmp. I created a slightly smaller GTF file that only contained the 35789 transcripts that were successfully extracted and tried it on that, but still got the "sequence without letters" and "Input error: Invalid input file" errors.

I checked 144030_candidate_lncRNA.gtf.oneLine.fa, and sure enough there is an entry in the fasta at lines 14909 and 14910 like this:

>unknown_transcript_1

There are the same >unknown_transcript_1 followed by an empty line in 85227_candidate_lncRNA.gtf.coding_orf.fa and 85227_candidate_lncRNA.gtf.coding_rna.fa. However, I cannot determine where these are coming from.

My reference fasta and GTF files are from here: https://www.ncbi.nlm.nih.gov/assembly/GCF_003990815.1/

My input GTF file is the result of a isoseq pipeline. The most recent step that generated the GTF was SQANTI3, using of course the same reference genome for mapping.

I am using the conda environment described in the README (thanks for that! so nice not to have to install a bunch of things just to try out the pipeline).

I would really appreciate any suggestions! I can also send the data via email if that would help.

Best,

Joel

tderrien commented 3 years ago

Hi @joelnitta,

First of all, thank you very much for using FEELnc and also for the detailed report of the issue.

It seems the reference annotation contains some unusual transcripts and tbh, we are more used of the EnsEMBL files formats than Refseq.

One possible way to solve this issue would be to extract a priori mRNA transcripts from the reference annotation (e.g those having a CDS).

awk '$3=="CDS"' $DATADIR/reference/daphnia_genome.gtf > $DATADIR/reference/daphnia_genome_CDS.gtf 

And then used this file to train FEELnc for mRNA sequences and the shuffle mode for lncRNAs :

FEELnc_codpot.pl \
   -i $ANALYSISDIR/feelnc/candidate_lncRNA.gtf \
   -a  $DATADIR/reference/daphnia_genome_CDS.gtf \
   -g $DATADIR/reference/reference.fasta \
   --mode=shuffle

Note that you won't need the transcript_biotype=protein_coding option anymore since 1) they are already extracted and 2) it is more for EnsEMBL gtf file formats.

HTH

Thomas

joelnitta commented 3 years ago

Thanks for the prompt response!

I tried your suggestion, but unfortunately it didn't work:

(/home/joelnitta/kato_daphnia/bin/feelnc) [joelnitta@juno feelnc]$ FEELnc_codpot.pl \
>    -i $ANALYSISDIR/feelnc/candidate_lncRNA.gtf \
>    -a  $DATADIR/reference/daphnia_genome_CDS.gtf \
>    -g $DATADIR/reference/reference.fasta \
>    --mode=shuffle
Possible precedence issue with control flow operator at /home/joelnitta/kato_daphnia/bin/feelnc/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Warning: Output directory './feelnc_codpot_out' already exists... files might be overwritten!
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
> Extract ORFs/cDNAs for mRNAs from a GTF file
Parsing file 'daphnia_genome_CDS.gtf'...
Parse input file:             [----------------------------------------------------------------------------------------------------]
    Your input GTF file 'daphnia_genome_CDS.gtf' contains *23571* transcripts
    Extracting ORFs/cDNAs 23571/23571...
    Extracted '23571' ORF/cDNAs sequences on '23571'.

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

This time though, the "MSG: Got a sequence without letters. Could not guess alphabet" was repeated many more times: 23571 times actually, the same as the number of transcripts.

At the end, I got a similar error as before:

> The lncRNA training file is not set. Get ORFs/cDNAs for lncRNAs by shuffling mRNA sequences
Input error: Invalid input file, expecting nucleotide sequence line on line 2

Failed to run fasta_ushuffle:
/home/joelnitta/kato_daphnia/bin/feelnc/bin/fasta_ushuffle -k 7 -s 1234 -n 3 < /tmp//200741_candidate_lncRNA.gtf.oneLine.fa > /tmp//200741_candidate_lncRNA.gtf.ushuffle.3perm.fa
(/home/joelnitta/kato_daphnia/bin/feelnc) [joelnitta@juno feelnc]$ head $DATADIR/reference/daphnia_genome_CDS.gtf 

I would like to use EnsEMBL, but unfortunately it does not include my study species.

tderrien commented 3 years ago

Ok, thanks for trying @joelnitta.

I rapidly tested with a toy candidate_lncRNA.gtf file and with these 2 reference files from RefSeq :

without any issue....

Then, either the reference files are not the same at one step of the pipeline, or there is smthg wrong with the candidate_lncRNA.gtf file. If you don't mind sharing it, I could give it a try.

Cheers,

vwucher commented 3 years ago

Hi @joelnitta, Thanks for sharing the files. I manage to find the issue. In your reference GTF, the one with the mRNAs, etc, you have several transcripts that have the same name: "unknown_transcript_1" and there are from different genes. Here just the two first occurrences):

> awk '$3=="CDS" && $0~/unknown_transcript_1/' daphnia_genome.gtf | hd -n2
NC_026914.1 RefSeq  CDS 135 1118    .   +   0   gene_id "YC34_gp13"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ND2"; locus_tag "YC34_gp13"; product "NADH dehydrogenase subunit 2"; protein_id "YP_009133114.1"; transl_table "5"; 
NC_026914.1 RefSeq  CDS 1313    2846    .   +   0   gene_id "YC34_gp12"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "COX1"; locus_tag "YC34_gp12"; note "start codon not determined; TAA stop codon is completed by the addition of 3' A residues to the mRNA"; partial "true"; product "cytochrome c oxidase subunit I"; protein_id "YP_009133115.1"; transl_except "(pos:2849..2849,aa:TERM)"; transl_table "5"; 
...

By removing the transcripts, the program doesn't crash:

bash redobug_noUnknown.sh
Possible precedence issue with control flow operator at /home/valentin/programme/feelnc_conda/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
> Extract ORFs/cDNAs for mRNAs from a GTF file
Parsing file 'daphnia_genome_noUnknown.gtf'...
Parse input file:             [----------------------------------------------------------------------------------------------------]
    Your input GTF file 'daphnia_genome_noUnknown.gtf' contains *35911* transcripts
    Extracting ORFs/cDNAs 35788/35911...
    Extracted '35788' ORF/cDNAs sequences on '35911'.
> The lncRNA training file is not set. Get ORFs/cDNAs for lncRNAs by shuffling mRNA sequences
    Extracting ORFs/cDNAs 105535/107364...
    Extracted '105535' ORF/cDNAs sequences on '107364'.
...

I didn't wait for it to finish (I do it locally so it is a bit consuming). But you can try by removing the "unknown_transcript_1".

So we have either an issue when parsing the data or one of the "unknown_transcript_1" sequences has a problem and this is why the sequence is empty. Anyway, I think that maybe changing the names of the transcripts could be a good idea, to avoid erasing the sequences during the parsing and the extraction of the FASTA files.

Tell us if it works for you or if you have other issues. Best, Valentin

joelnitta commented 3 years ago

Thanks so much for looking into the GTF files. I created a new GTF file with all "unknown_transcript_1" rows removed, and it worked!

Can you explain a bit more what you mean by "changing the names of the transcripts" though? For example, to change "unknown_transcript_1" so that each instance points to a unique gene?

vwucher commented 3 years ago

Hi, I am glad that FEElnc is now working! And for the name, yes I am not sure why an empty sequence was generated. It is possible that one of the "unknown_transcript_1" generated this empty sequence, or it is possible that because transcripts had the same name generated this error. So yes, the idea was, if you want to keep these transcripts to change their names to have a unique name for each transcript. And if this is not working, then one of these "unknown_transcript_1" generates the error I would say.

Tell us if you need more help! Valentin

joelnitta commented 3 years ago

Hi Valentin,

I tried changing the GTF so that each "unknown_transcript" had a unique ID corresponding to the gene ID. For example,

root@45b379115647:/data/reference# cat daphnia_genome_unknown_fixed.gtf | tail
NC_026914.1 RefSeq  CDS 9690    10190   .   +   0   gene_id "YC34_gp03"; transcript_id "unknown_transcript_10"; gbkey "CDS"; gene "ND6"; locus_tag "YC34_gp03"; product "NADH dehydrogenase subunit 6"; protein_id "YP_009133124.1"; transl_table "5";
NC_026914.1 RefSeq  start_codon 9690    9692    .   +   0   gene_id "YC34_gp03"; transcript_id "unknown_transcript_10"; gbkey "CDS"; gene "ND6"; locus_tag "YC34_gp03"; product "NADH dehydrogenase subunit 6"; protein_id "YP_009133124.1"; transl_table "5";
NC_026914.1 RefSeq  stop_codon  10191   10193   .   +   0   gene_id "YC34_gp03"; transcript_id "unknown_transcript_10"; gbkey "CDS"; gene "ND6"; locus_tag "YC34_gp03"; product "NADH dehydrogenase subunit 6"; protein_id "YP_009133124.1"; transl_table "5";
NC_026914.1 RefSeq  CDS 10193   10514   .   +   0   gene_id "YC34_gp02"; transcript_id "unknown_transcript_11"; gbkey "CDS"; gene "CYTB"; locus_tag "YC34_gp02"; product "cytochrome b"; protein_id "YP_009133125.1"; transl_table "5";
NC_026914.1 RefSeq  CDS 10514   11322   .   +   0   gene_id "YC34_gp02"; transcript_id "unknown_transcript_11"; gbkey "CDS"; gene "CYTB"; locus_tag "YC34_gp02"; product "cytochrome b"; protein_id "YP_009133125.1"; transl_table "5";
NC_026914.1 RefSeq  start_codon 10193   10195   .   +   0   gene_id "YC34_gp02"; transcript_id "unknown_transcript_11"; gbkey "CDS"; gene "CYTB"; locus_tag "YC34_gp02"; product "cytochrome b"; protein_id "YP_009133125.1"; transl_table "5";
NC_026914.1 RefSeq  stop_codon  11323   11325   .   +   0   gene_id "YC34_gp02"; transcript_id "unknown_transcript_11"; gbkey "CDS"; gene "CYTB"; locus_tag "YC34_gp02"; product "cytochrome b"; protein_id "YP_009133125.1"; transl_table "5";

(here, gene_id "YC34_gp03" corresponds to transcript_id "unknown_transcript_10" and gene_id "YC34_gp02" to transcript_id "unknown_transcript_11")

However, I still get the

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet

error (actually, it showed up 13 times).

So it seems there is something more to it than just the uniqueness of the transcript name. Anyways, it works if I remove all of the lines with unknown_transcript, and they are only a very small part of the GTF file, so I will consider this issue solved. Please feel free to open it again if somebody runs into something similar and it might be helpful to try and fix this some other way.

vwucher commented 3 years ago

Hi Joel,

Thanks for the update and good to know that removing these sequences is not a big issue for you. Even if the fact that these transcripts have the same name can be an issue, what I suspect is that one of these sequences is "messed up", in the sense that no sequence is extracted from it. If you want to keep them, maybe one thing to do will be to extract the coordinates of these sequences and check directly on the FASTA of the reference genome if all of them mapped to an existing genomic sequence.

Don't hesitate if you have any other issues or if you need help for anything else. Have a nice day! Valentin