Closed joelnitta closed 3 years ago
Hi @joelnitta,
First of all, thank you very much for using FEELnc and also for the detailed report of the issue.
It seems the reference annotation contains some unusual transcripts and tbh, we are more used of the EnsEMBL files formats than Refseq.
One possible way to solve this issue would be to extract a priori mRNA
transcripts from the reference annotation (e.g those having a CDS).
awk '$3=="CDS"' $DATADIR/reference/daphnia_genome.gtf > $DATADIR/reference/daphnia_genome_CDS.gtf
And then used this file to train FEELnc for mRNA
sequences and the shuffle
mode for lncRNAs
:
FEELnc_codpot.pl \
-i $ANALYSISDIR/feelnc/candidate_lncRNA.gtf \
-a $DATADIR/reference/daphnia_genome_CDS.gtf \
-g $DATADIR/reference/reference.fasta \
--mode=shuffle
Note that you won't need the transcript_biotype=protein_coding
option anymore since 1) they are already extracted and 2) it is more for EnsEMBL gtf file formats.
HTH
Thomas
Thanks for the prompt response!
I tried your suggestion, but unfortunately it didn't work:
(/home/joelnitta/kato_daphnia/bin/feelnc) [joelnitta@juno feelnc]$ FEELnc_codpot.pl \
> -i $ANALYSISDIR/feelnc/candidate_lncRNA.gtf \
> -a $DATADIR/reference/daphnia_genome_CDS.gtf \
> -g $DATADIR/reference/reference.fasta \
> --mode=shuffle
Possible precedence issue with control flow operator at /home/joelnitta/kato_daphnia/bin/feelnc/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Warning: Output directory './feelnc_codpot_out' already exists... files might be overwritten!
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
> Extract ORFs/cDNAs for mRNAs from a GTF file
Parsing file 'daphnia_genome_CDS.gtf'...
Parse input file: [----------------------------------------------------------------------------------------------------]
Your input GTF file 'daphnia_genome_CDS.gtf' contains *23571* transcripts
Extracting ORFs/cDNAs 23571/23571...
Extracted '23571' ORF/cDNAs sequences on '23571'.
--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------
This time though, the "MSG: Got a sequence without letters. Could not guess alphabet" was repeated many more times: 23571 times actually, the same as the number of transcripts.
At the end, I got a similar error as before:
> The lncRNA training file is not set. Get ORFs/cDNAs for lncRNAs by shuffling mRNA sequences
Input error: Invalid input file, expecting nucleotide sequence line on line 2
Failed to run fasta_ushuffle:
/home/joelnitta/kato_daphnia/bin/feelnc/bin/fasta_ushuffle -k 7 -s 1234 -n 3 < /tmp//200741_candidate_lncRNA.gtf.oneLine.fa > /tmp//200741_candidate_lncRNA.gtf.ushuffle.3perm.fa
(/home/joelnitta/kato_daphnia/bin/feelnc) [joelnitta@juno feelnc]$ head $DATADIR/reference/daphnia_genome_CDS.gtf
I would like to use EnsEMBL, but unfortunately it does not include my study species.
Ok, thanks for trying @joelnitta.
I rapidly tested with a toy candidate_lncRNA.gtf
file and with these 2 reference files from RefSeq :
GCF_003990815.1_ASM399081v1_genomic.fna
GCF_003990815.1_ASM399081v1_genomic.gtf
without any issue....
Then, either the reference files are not the same at one step of the pipeline, or there is smthg wrong with the candidate_lncRNA.gtf
file. If you don't mind sharing it, I could give it a try.
Cheers,
Hi @joelnitta, Thanks for sharing the files. I manage to find the issue. In your reference GTF, the one with the mRNAs, etc, you have several transcripts that have the same name: "unknown_transcript_1" and there are from different genes. Here just the two first occurrences):
> awk '$3=="CDS" && $0~/unknown_transcript_1/' daphnia_genome.gtf | hd -n2
NC_026914.1 RefSeq CDS 135 1118 . + 0 gene_id "YC34_gp13"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ND2"; locus_tag "YC34_gp13"; product "NADH dehydrogenase subunit 2"; protein_id "YP_009133114.1"; transl_table "5";
NC_026914.1 RefSeq CDS 1313 2846 . + 0 gene_id "YC34_gp12"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "COX1"; locus_tag "YC34_gp12"; note "start codon not determined; TAA stop codon is completed by the addition of 3' A residues to the mRNA"; partial "true"; product "cytochrome c oxidase subunit I"; protein_id "YP_009133115.1"; transl_except "(pos:2849..2849,aa:TERM)"; transl_table "5";
...
By removing the transcripts, the program doesn't crash:
bash redobug_noUnknown.sh
Possible precedence issue with control flow operator at /home/valentin/programme/feelnc_conda/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
You do not have specified a maximum number mRNAs transcripts for the training. Use all the annotation, can be long...
You do not have specified a maximum number lncRNA transcripts for the training. Use all the annotation, can be long...
> Extract ORFs/cDNAs for mRNAs from a GTF file
Parsing file 'daphnia_genome_noUnknown.gtf'...
Parse input file: [----------------------------------------------------------------------------------------------------]
Your input GTF file 'daphnia_genome_noUnknown.gtf' contains *35911* transcripts
Extracting ORFs/cDNAs 35788/35911...
Extracted '35788' ORF/cDNAs sequences on '35911'.
> The lncRNA training file is not set. Get ORFs/cDNAs for lncRNAs by shuffling mRNA sequences
Extracting ORFs/cDNAs 105535/107364...
Extracted '105535' ORF/cDNAs sequences on '107364'.
...
I didn't wait for it to finish (I do it locally so it is a bit consuming). But you can try by removing the "unknown_transcript_1".
So we have either an issue when parsing the data or one of the "unknown_transcript_1" sequences has a problem and this is why the sequence is empty. Anyway, I think that maybe changing the names of the transcripts could be a good idea, to avoid erasing the sequences during the parsing and the extraction of the FASTA files.
Tell us if it works for you or if you have other issues. Best, Valentin
Thanks so much for looking into the GTF files. I created a new GTF file with all "unknown_transcript_1" rows removed, and it worked!
Can you explain a bit more what you mean by "changing the names of the transcripts" though? For example, to change "unknown_transcript_1" so that each instance points to a unique gene?
Hi, I am glad that FEElnc is now working! And for the name, yes I am not sure why an empty sequence was generated. It is possible that one of the "unknown_transcript_1" generated this empty sequence, or it is possible that because transcripts had the same name generated this error. So yes, the idea was, if you want to keep these transcripts to change their names to have a unique name for each transcript. And if this is not working, then one of these "unknown_transcript_1" generates the error I would say.
Tell us if you need more help! Valentin
Hi Valentin,
I tried changing the GTF so that each "unknown_transcript" had a unique ID corresponding to the gene ID. For example,
root@45b379115647:/data/reference# cat daphnia_genome_unknown_fixed.gtf | tail
NC_026914.1 RefSeq CDS 9690 10190 . + 0 gene_id "YC34_gp03"; transcript_id "unknown_transcript_10"; gbkey "CDS"; gene "ND6"; locus_tag "YC34_gp03"; product "NADH dehydrogenase subunit 6"; protein_id "YP_009133124.1"; transl_table "5";
NC_026914.1 RefSeq start_codon 9690 9692 . + 0 gene_id "YC34_gp03"; transcript_id "unknown_transcript_10"; gbkey "CDS"; gene "ND6"; locus_tag "YC34_gp03"; product "NADH dehydrogenase subunit 6"; protein_id "YP_009133124.1"; transl_table "5";
NC_026914.1 RefSeq stop_codon 10191 10193 . + 0 gene_id "YC34_gp03"; transcript_id "unknown_transcript_10"; gbkey "CDS"; gene "ND6"; locus_tag "YC34_gp03"; product "NADH dehydrogenase subunit 6"; protein_id "YP_009133124.1"; transl_table "5";
NC_026914.1 RefSeq CDS 10193 10514 . + 0 gene_id "YC34_gp02"; transcript_id "unknown_transcript_11"; gbkey "CDS"; gene "CYTB"; locus_tag "YC34_gp02"; product "cytochrome b"; protein_id "YP_009133125.1"; transl_table "5";
NC_026914.1 RefSeq CDS 10514 11322 . + 0 gene_id "YC34_gp02"; transcript_id "unknown_transcript_11"; gbkey "CDS"; gene "CYTB"; locus_tag "YC34_gp02"; product "cytochrome b"; protein_id "YP_009133125.1"; transl_table "5";
NC_026914.1 RefSeq start_codon 10193 10195 . + 0 gene_id "YC34_gp02"; transcript_id "unknown_transcript_11"; gbkey "CDS"; gene "CYTB"; locus_tag "YC34_gp02"; product "cytochrome b"; protein_id "YP_009133125.1"; transl_table "5";
NC_026914.1 RefSeq stop_codon 11323 11325 . + 0 gene_id "YC34_gp02"; transcript_id "unknown_transcript_11"; gbkey "CDS"; gene "CYTB"; locus_tag "YC34_gp02"; product "cytochrome b"; protein_id "YP_009133125.1"; transl_table "5";
(here, gene_id "YC34_gp03"
corresponds to transcript_id "unknown_transcript_10"
and gene_id "YC34_gp02"
to transcript_id "unknown_transcript_11"
)
However, I still get the
--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
error (actually, it showed up 13 times).
So it seems there is something more to it than just the uniqueness of the transcript name. Anyways, it works if I remove all of the lines with unknown_transcript
, and they are only a very small part of the GTF file, so I will consider this issue solved. Please feel free to open it again if somebody runs into something similar and it might be helpful to try and fix this some other way.
Hi Joel,
Thanks for the update and good to know that removing these sequences is not a big issue for you. Even if the fact that these transcripts have the same name can be an issue, what I suspect is that one of these sequences is "messed up", in the sense that no sequence is extracted from it. If you want to keep them, maybe one thing to do will be to extract the coordinates of these sequences and check directly on the FASTA of the reference genome if all of them mapped to an existing genomic sequence.
Don't hesitate if you have any other issues or if you need help for anything else. Have a nice day! Valentin
Hi, thanks for this software!
I am getting an error with
FEELnc_codpot.pl
, as follows:As you can see, it does not extract all the transcripts successfully.
I tried to find what was causing the error by saving the intermediate output with
--keeptmp
. I created a slightly smaller GTF file that only contained the 35789 transcripts that were successfully extracted and tried it on that, but still got the "sequence without letters" and "Input error: Invalid input file" errors.I checked
144030_candidate_lncRNA.gtf.oneLine.fa
, and sure enough there is an entry in the fasta at lines 14909 and 14910 like this:There are the same
>unknown_transcript_1
followed by an empty line in85227_candidate_lncRNA.gtf.coding_orf.fa
and85227_candidate_lncRNA.gtf.coding_rna.fa
. However, I cannot determine where these are coming from.My reference fasta and GTF files are from here: https://www.ncbi.nlm.nih.gov/assembly/GCF_003990815.1/
My input GTF file is the result of a isoseq pipeline. The most recent step that generated the GTF was SQANTI3, using of course the same reference genome for mapping.
I am using the conda environment described in the README (thanks for that! so nice not to have to install a bunch of things just to try out the pipeline).
I would really appreciate any suggestions! I can also send the data via email if that would help.
Best,
Joel