tderrien / FEELnc

FEELnc : FlExible Extraction of LncRNA
GNU General Public License v3.0
82 stars 28 forks source link

Error running feeling_codpot.pl #49

Closed Bart-Joosten closed 2 years ago

Bart-Joosten commented 3 years ago

Hi,

I'm trying to find lncRNA using Feelnc. My pipeline consists of the following: Trimmomatic --> HISAT2 --> stringtie and stringtie assembly --> feelnc filter using the following input: FEELnc_filter.pl -i new_merged.combined.gtf -a gencode.v38.annotation.gtf > new_candidate_model.gtf. This command was executed without any issues. However, upon trying the following command: FEELnc_codpot.pl -i new_candidate_model.gtf -a gencode.v38.annotation.gtf -b transcript_biotype=protein_coding -l gencode.v38.long_noncoding_RNAs.gtf -g /Users/username/Downloads/gencode.v38.transcripts.fa —mode=shuffle, I got an error. The output from this last was:

Multiple replications of the following:

Parse input file: [------------------------------------------------

And multiple different versions of the following:

ExtractFromFeature::feature2seq: your seq chr5:131954211-131954371 returns an empty string!...Check 'chr' prefix between your annotation and genome files or remove your genome index file ('/Users/username/Downloads/gencode.v38.transcripts.fa.index')...

Where chromosome number (chr5, 6, 7, etc.) and location on the chromosome are differing.

The output ends with: The number of complete ORF found with computeORF mode is 0 transcripts... That's not enough to train the program

I have also found the following GitHub issue: https://github.com/tderrien/FEELnc/issues/3 - where it was suggested to remove the fasta file from the command. However, when I attempt this, feelnc outputs: Error: Cannot read your genome file '' (-g option)....

Also, it shows that all of the following is mandatory (and not optional as stated on the GitHub page): Usage: FEELnc_codpot.pl -i transcripts.GTF -a known_mRNA.GTF -g genome.FA -l known_lnc.GTF [options...]. Running feelnc_codpot.pl without all the options enabled throws an error

vwucher commented 3 years ago

Hi,

Thanks for all the explanation on the issues.

Concerning the first issue, did you tried to run the example we provided? Did it run without error? Also, did you check the notation of the chromosome between all the GTF and FASTA files you used? If in some file it is written "chr15" and "15" in the other, FEELnc cannot extract the sequence. And did you removed the index file and run it again as suggested by the error?

For the other issue about removing the FASTA file, it is normal that it doesn't work here. You are providing GTF files for mRNAs and lncRNAs, so FEELnc need to have the FASTA file of the genome (the sequence) to be able to extract the sequence of the RNAs. The sequence of the genome is no important if you are providing FASTA directly for your mRNAs and lncRNAs.

Also, from your command line: FEELnc_codpot.pl -i new_candidate_model.gtf -a gencode.v38.annotation.gtf -b transcript_biotype=protein_coding -l gencode.v38.long_noncoding_RNAs.gtf -g /Users/username/Downloads/gencode.v38.transcripts.fa —mode=shuffle

You try to use the shuffle mode "-mode=shuffle" while providing a file of lncRNA "-l gencode.v38.long_noncoding_RNAs.gtf". The shuffle mode is intended to be used when no lncRNA are provided so the "-mode" option is not needed if a file of lncRNA sequences is provided.

Please tell use how it works with the example files and if removing the index file make FEELnc_codpot.pl work. Have a nice day, Valentin

Bart-Joosten commented 3 years ago

Hi Valentin,

Thanks for your explanation. I tried the pipeline using a different fasta file and feelnc filter and codpot (without shuffle) ran without any errors. (Also, the example ran without any errors when I first used feelnc).

Now I am only having some problems with the feelnc classifier:

FEELnc_classifier.pl -i feelnc_codpot_out/new_candidate_model.gtf.lncRNA.gtf -a gencode.v38.annotation.gtf > candidate_lncRNA_classes.txt

For now I have just used the lncRNA file which has been outputted by feelnc_codpot without combining it with the noORF file. The error says:

window : 10000 - max window : 100000 - lncrna : feelnc_codpot_out/new_candidate_model.gtf.lncRNA.gtf - mrna : gencode.v38.annotation.gtf - biotype: 0 Importation merge GTF file : 4105 lncRNA done HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size Importation merge GTF file : 237012 mRNA done ... Looking for interactions, currently eq to 1400 ... Looking for interactions, currently eq to 1600 ... Looking for interactions, currently eq to 2000 Can't call method "primary_tag" on an undefined value at /Users/hamza/perl5/perlbrew/perls/perl-5.18.4/lib/site_perl/5.18.4/darwin-2level/Bio/SeqFeature/LncRNAs_Factory.pm line 217.

I also wanted to ask you another question: I am trying to find diff. expressed lncRNAs between two groups but I am somewhat new to the area of RNAseq. For expression analysis using featurecounts or HTseq, I should provide the merged stringtie/cufflinks GTF file correct? However, for what process should I use the two new GTF files (.gtf.noORF.gtf and gtf.lncRNA.gtf) produced by feelnc_codpot?

Thanks again for your help!

vwucher commented 3 years ago

Hi,

I think one of the issue here is that you tried to classify your new lncRNA "feelnc_codpot_out/new_candidate_model.gtf.lncRNA.gtf" against all the element of the genome: "Importation merge GTF file : 237012 mRNA done" Maybe one think you can try first will be to filter your reference file "gencode.v38.annotation.gtf" by selecting only the "protein_coding" RNA. Can you try this and tell us if it works with the filtered file?

Concerning the differential expression analysis, the lncRNA that have been identified using FEELnc are already in your new GTF, i.e. the output of stringtie. What the tool does is "just" to classify, or add a tag, to each input sequence to know if they are coding or not. So normally, if you get the expression of your transcripts using HTseq, the new lncRNA should already be there. You just need to get the ID of the new lncRNAs in your "gtf.lncRNA.gtf" file and then analyse only them if you want to see differentially expressed new lncRNA or highlighting them.

Tell me if the classifier works with the filtered file and if I was clear in my explanation. Have a nice day, Valentin

Bart-Joosten commented 3 years ago

Hi Valentin,

I saw that I was using the entire human annotation GTF for all the steps instead of a GTF with the mRNAs only, so as you suggested, I isolated all the protein coding transcripts using: grep protein_coding gencode.v38.annotation.gtf > mrna_gtf.gtf. I checked that this gave me only protein coding transcripts and I did all the FEELnc modules again which went well without any errors. However, unfortunately I got a similar error at FEELnc_classifier as before:

bash-3.2$ FEELnc_classifier.pl -i feelnc_codpot_out/mrna_filter_candidate_model.gtf.lncRNA.gtf -a mrna_gtf.gtf > candidate_lncRNA_classes.txt window : 10000 - max window : 100000 - lncrna : feelnc_codpot_out/mrna_filter_candidate_model.gtf.lncRNA.gtf - mrna : mrna_gtf.gtf - biotype: 0 Importation merge GTF file : 50040 lncRNA done HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size HASH: Out of overflow pages. Increase page size Importation merge GTF file : 162041 mRNA done ... Looking for interactions, currently eq to 900 ... Looking for interactions, currently eq to 1200 ... Looking for interactions, currently eq to 1400 ... Looking for interactions, currently eq to 2100 ... Looking for interactions, currently eq to 2700 ... Looking for interactions, currently eq to 2900 ... Looking for interactions, currently eq to 3400 ... Looking for interactions, currently eq to 3900 ... Looking for interactions, currently eq to 4100 ... Looking for interactions, currently eq to 5400 ... Looking for interactions, currently eq to 6100 Can't call method "primary_tag" on an undefined value at /Users/username/perl5/perlbrew/perls/perl-5.18.4/lib/site_perl/5.18.4/darwin-2level/Bio/SeqFeature/LncRNAs_Factory.pm line 217.

Concerning the differential expression analysis, I think I understand now, thanks. I have used HTseq using my BAM files and the entire human annotation GTF and produced gene counts. Using the Ensemble gene IDs from the gtf.lncRNA.gtf I have found the counts from several lncRNAs :). However, as I was using stringtie, I also got some MSTRG gene IDs. Do you know how to translate those to ensemble gene IDs? Or do you prefer to just use cufflinks and cuffmerge to only get ensemble IDs which are ready to use for HTseq to find gene counts from lncRNAs?

Lastly, I was interested on your opinion about something: do you think that there will be many missing lncRNAs if one would use poly-a selected RNA seq instead of rRNA removal?

Thanks!

vwucher commented 3 years ago

Hi Bart,

For the classifier error, I am wondering if the error doesn't come from the fact that you have a lot of transcripts... Before you had 4,105 lncRNAs and 237,012 transcripts from your reference and now 50,040 lncRNAs (a lot more) and 16,204 transcripts from your reference. Can you try to take like the first 10 lncRNAs and like 100 transcripts from your reference? Take them from the same chromosome if possible. Because this step is working on the example files right?

For converting stringtie ID to Ensembl ID, it will depend if the transcript exist or not in the Ensembl annotation. For this I will maybe let @tderrien answer, he will be more helpful. But I will say it is will be better to keep the maximal annotation you have and then get the counts from the new GTF file. And for the last question, there will be a difference between the two protocol but I cannot tell in which proportion... @tderrien you have an idea?

Tell us if the classifier works on the example file (which is the case if I remember correctly) and if it works on a subset of all your transcripts (both files). Bye! Val

tderrien commented 3 years ago

Hello,

Regarding the error message, it may indeed be related to the large number of transcripts given as input to the FEELnc_classifier.pl. This has also been reported by others using the bioperl Berkeley DB module. Maybe @flegeai can confirm? ;-)

I think MSTRG gene IDs are newly reconstructed genes from Stringtie and are thus (by definition) not present in the reference annotation (Ensembl). I don't really know HTSeq but if you want to have gene counts based on a reference .gtf (including or not new genes) and . bam files , there are many tools to do that (RSEM, Salmon/Kallisto, ...)

For your last question, some papers have compared polyA versus non-polyA lncRNA annotation (e.g. Djebali et al, 2012 ) but it has to be noted that (lnc)RNAs can also exist as bimorphic (both poly and non-polyA) depending on the stage/conditions they were sampled.

Hope this helps.

Best,

Thomas

Bart-Joosten commented 3 years ago

Hi Valentin and Thomas,

Indeed, the example steps were working well for me. As you suggested, I have used 12 lncRNAs from the gtf.lncrna.gtf and 100 transcripts from the mrna_gtf.gtf both from chr1 and ran the feelnc_classifier module: FEELnc_classifier.pl -i subset_lncrnas.gtf -a subset_annotation_file.gtf > candidate_lncRNA_classes.txt --> output window : 10000 - max window : 100000 - lncrna : subset_lncrnas.gtf - mrna : subset_annotation_file.gtf - biotype: 0 Importation merge GTF file : 2 lncRNA done Importation merge GTF file : 6 mRNA done

Log output was as follows:

FEELnc Classification

lncRNA file : lncrna : subset_lncrnas.gtf

mRNA file : subset_annotation_file.gtf

Minimal window size : 10000

Maximal window size : 100000

Number of lncRNA : 2

Number of mRNA : 6

Number of interaction : 0

Number of lncRNA without interaction : 2

List of lncRNA without interaction : ENST00000664672.1 ENST00000625722.2

candidate_lncRNA_classes.txt: isBest lncRNA_gene lncRNA_transcript partnerRNA_gene partnerRNA_transcript direction type distance subtype location

Do you have an idea what the issue could be? Do you think its a memory issue or something else?

Thanks for your suggestions on my other questions!

vwucher commented 3 years ago

Hi Bart,

From what I see from your log, it looks like the classifier works fine. So I think that the issue is the number of transcripts you have in your other runs as @tderrien pointed in his answer. One way to solve it will be to divide your file of lncRNAs into several files and to run the classifier on each one of them. But the fact that you have all these lncRNAs is also a bit weird. YOu work on human right? So it is strange that you have so much lncRNA transcripts that are supposed to be new. Maybe double check the different step and if the lncRNAs you have at the start of the classifier step is the good one and if it is, then split the file into several ones and then run the classifier on each one.

Tell us if you manage to get what you want! Valentin

tderrien commented 3 years ago

Hi,

I think it worked fine as indicated in the output of FEELnc for the 2 lncRNAs (not 10 as you mentioned) :

#List of lncRNA without interaction : ENST00000664672.1 ENST00000625722.2

If no interaction are found, it could be due to the fact that the lncRNAs randomly (?) chosen are not localized close enough (i.e. within the considered window -w) to the randomly selected (?) mRNAs. You could either select the entire chr1 for both files (grep "^chr1" {biotype}.gtf > {biotype}_chr1.gtf) not to bias the selection of lncRNA/mRNAs and/or increase the -w, --window parameters to search for more distant lncRNA/mRNA interactions.

If the first strategy works, I'd generalize it to all chromosomes given the high number of transcripts you have.

HTH

Bart-Joosten commented 3 years ago

Hi Valentin and Thomas,

I have tried your suggestion of splitting lncRNAs and mRNAs from GTFs based on chromosome number and it is working as expected!

Also, Valentin wrote: "So it is strange that you have so much lncRNA transcripts that are supposed to be new." I checked my feelnc commands and other commands for a mistake that could be the cause of all the lncRNAs that have been found. However i could not find one. Do you think it is possible/likely that i have made a mistake somewhere and any idea where it could be? And how do you know that many new lncRNA transcripts have been found?

Thanks a lot!

vwucher commented 3 years ago

Hi Bart,

That's cool that now everything work fine! Concerning my comment, I don't think that you did a mistake. But it is just that the number surprised me a little bit because you work on human right? When you performed the 1rst step, the filter, in the reference you included only the protein coding genes? If yes, then it is probably why you have a lot of lncRNAs in your 2nd run (in the first it was ~4k), you probably have also the previously identified lncRNAs from GENCODE.

Bye, Valentin

bart1920 commented 3 years ago

Hi Valentin,

Sorry to bother you again, I was finishing another project. You are correct, in the filter I included only the protein coding genes, however, when I count the number of lncRNA genes with MSTRG IDs in the gtf.lncrna.gtf from the codpot output, I get >160k hits (and 170k hits with ENSG IDs). This means there are >160k new lncRNA genes right? Which seems too high when I compare it to other published articles? Do you have an idea how this is possible?

Kind regards

tderrien commented 2 years ago

Hi Bart,

This means there are >160k new lncRNA genes right?

Not exactly. It probably corresponds to the number of lines/exons not genes.

Which seems too high when I compare it to other published articles? Do you have an idea how this is possible?

It's hard to say without knowing the number and types (normal/cancer) of RNAseq samples you used.

Best,