nunofonseca / irap

integrated RNA-seq Analysis Pipeline
GNU General Public License v3.0
82 stars 33 forks source link

irap_raw2metric no html file, not gzipped #50

Closed freekvh closed 6 years ago

freekvh commented 6 years ago

Hi Nuno,

First of all, I'm on 0.8.5p2 so I'm sorry of you already solved this.

I noticed (renormalizing data on protein coding genes only), that the help of irap_raw2metric specifies:

-o OUT, --out=OUT
    Output file name prefix. An HTML and TSV.gz file will be created.

But I don't see any html being generated and the resulting file is also not gzipped.

Highest regards,

Freek.

freekvh commented 6 years ago

On second though, perhaps this is because I'm not specifying a GTF file, now that I did, the process takes much much longer (still waiting for it to finish). Is it now regenerating the lengths?

Another question: Are you defining the gene lengths (or the exon lengths?) as the combined length of all exons? So a projection of all exons on one axis and then computing the combined lengths? Meaning that exotic transcripts that may hardly occur may still add to gene/combined exon lengths?

And, related, one last question: irap_raw2metric takes as input "genes.raw.rsem.tsv" (if rsem was used), so if I'm not mistaken, that means that it only uses the gene,exon,transcript option to determine what lengths to use. In a normal RNAseq case you'd say you don't want intronic reads to count so leave them out and yet, in the logs I find that "--feature gene" is used by irap_raw2metric. Shouldn't that be "--feature exon"? Because I assume RSEM will also only use exonic reads.

Highest regards.

nunofonseca commented 6 years ago

Hi Freek, The help message was a bit out of date - I just pushed a commit fixing that. irap_raw2metric does not generate a tsv.gz or html file. It may save the quantification in tsv format or a matrix market format.

Yes, if you provide the GTF it will calculate the lengths.

By the default the gene length is computed as the sum of the length of all collapsed exons of the gene.

exon1 exon2 =-----= =-----= exon3 =-----=

Collapsed =-------= =-----=

Length: =-------==-----=

Regarding you last point, the --feature option is to indicate which feature was quantified (gene, exon or transcript). If you provide a file with gene quantification you will want to pass --feature gene. Hopefully it is clear from the explanation above that the gene length does not include the introns. Please let me know if this is not clear. Cheers.

freekvh commented 6 years ago

Hi Nuno,

Thank you, this is very helpful indeed.

I do have yet another question then.

In the --help output there is only the option to --exclude biotypes from the normalization process. So I made a list of all biotypes except protein_coding and supplied it using the --exclude option. However this does not seem to alter the results, I keep getting back a TSV with all genes TPM normalized as is done in the default pipeline.

I found an option --mass_biotypes in the script itself (not part of the help), I also tried to use that to normalize on protein coding genes only but I get the error: long flag "mass_biotypes" is invalid

Perhaps it is easier if I describe what I want. What I want is to normalize on protein_coding genes only (this makes for more stable result in the face of more or less successful rRNA removal for example). I did this by filtering the raw output of rsem (genes.raw.rsem.tsv) for only protein coding genes and then running:

irap_raw2metric --tsv genes.raw.rsem.tsv --lengths Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --feature gene --metric tpm --out default_iRAP.tsv

(By the way I supply irap_raw2metric with biotype names as they are defined in the standard ensemble GTF file, i.e. protein_coding, lincRNA, rRNA, etc)

That works. However my colleagues and I are also interest in at least 1 non-protein coding gene. Which is why I'm playing with irap_raw2metric. I was hoping that I could normalize on protein coding genes but still get TPM values for non-protein_coding genes. I guess the then it is not really called TPM anymore because the sum of the values would be above 1 million? So if you have any thoughts on this they are very much appreciated as well. A possible solution would be include the biotype of the non-protein coding gene of interest perhaps, but this would mean that I start changing values if I add more biotypes of interest later on.

I still wonder how would a working command for TPM normalization look like on only protein_coding genes for irap_raw2metric? The following does not seem to work:

source /home/genomics/testsw/irap-0.8.5p2/irap_setup.sh
irap_raw2metric --tsv genes.raw.rsem.tsv --lengths Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --feature gene --metric tpm --out exclude_iRAP.tsv --exclude IG_pseudogene,Mt_tRNA,TR_D_gene,IG_C_pseudogene,Mt_rRNA,TR_J_gene,unitary_pseudogene,miRNA,TR_V_pseudogene,sRNA,bidirectional_promoter_lncRNA,pseudogene,IG_V_gene,antisense,rRNA,snRNA,IG_V_pseudogene,transcribed_processed_pseudogene,TR_J_pseudogene,misc_RNA,processed_pseudogene,IG_J_gene,snoRNA,IG_D_gene,sense_overlapping,IG_J_pseudogene,sense_intronic,scRNA,polymorphic_pseudogene,processed_transcript,unprocessed_pseudogene,TEC,ribozyme,vaultRNA,macro_lncRNA,3prime_overlapping_ncRNA,transcribed_unitary_pseudogene,IG_C_gene,TR_C_gene,TR_V_gene,transcribed_unprocessed_pseudogene,scaRNA,lincRNA,non_coding

I hope you don't mind all these questions, I very much appreciate your work on iRAP and your time.

Highest regards,

Freek.

nunofonseca commented 6 years ago

Hi, I'll try to cover all the questions.

In the --help output there is only the option to --exclude biotypes from the normalization process. So I >made a list of all biotypes except protein_coding and supplied it using the --exclude option. However this >does not seem to alter the results, I keep getting back a TSV with all genes TPM normalized as is done >in the default pipeline.

This option (--exclude) is deprecated.

I found an option --mass_biotypes in the script itself (not part of the help), I also tried to use that to >normalize on protein coding genes only but I get the error: long flag "mass_biotypes" is invalid

Please try the irap_raw2metric in the new version (1.0.0a). The help should be informative and it should work.

(By the way I supply irap_raw2metric with biotype names as they are defined in the standard ensemble >GTF file, i.e. protein_coding, lincRNA, rRNA, etc)

Those are the only biotypes that iRAP knows about.

That works. However my colleagues and I are also interest in at least 1 non-protein coding gene. Which >is why I'm playing with irap_raw2metric. I was hoping that I could normalize on protein coding genes but >still get TPM values for non-protein_coding genes. I guess the then it is not really called TPM anymore >because the sum of the values would be above 1 million?

Yes, for the non-coding genes is a modified version of TPM.

So if you have any thoughts on this they are very much appreciated as well. A possible solution would be include the biotype of the non-protein coding gene of interest perhaps, but this would mean that I start changing values if I add more biotypes of interest later on.

Yes, including the biotype of the non-protein coding gene is the most clean and easy to explain option. Assuming that your non-coding gene is a lincRNA, you may normalize the expression based solely on the expression of protein coding and lincRNA by running irap_raw2metric with --mass_biotype "protein_coding,lincRNA".

I still wonder how would a working command for TPM normalization look like on only protein_coding genes for irap_raw2metric? The following does not seem to work:

The following should work....

source /home/genomics/testsw/irap-1.0.0/irap_setup.sh irap_raw2metric --tsv /rst1/2017-0205_illuminaseq/data/seqData/analyzed/180222_NB501997_0051_AHTFJNBGX3/0051_P2017SEQE66S18_S1/0051_P2017SEQE66S18_S1/star/rsem/genes.raw.rsem.tsv --lengths /rst1/2017-0205_illuminaseq/data/references/Reference_Genomes/GRCh38.87/Annotations/Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --feature gene --metric tpm --out /rst1/2017-0205_illuminaseq/scratch/swo-358/exclude_iRAP.tsv --mass_biotypes protein_coding

I hope you don't mind all these questions, I very much appreciate your work on iRAP and your time.

Thank you for asking ;-) Cheers.

freekvh commented 6 years ago

Hi Nuno,

I'm still getting unexpected results. (Note from below I removed all folder structures to make it more readable.)

First I use the standard normalization that irap_raw2metric offers: (one possible source of error is that I use the Rdata file generated bji iRAP-0.8.5.p2 by the way)

Script:

source /home/genomics/testsw/irap-1.0.0a/irap_setup.sh
irap_raw2metric -i genes.raw.rsem.tsv --lengths Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --gtf Homo_sapiens.GRCh38.87.gtf --feature gene --metric tpm --out default_iRAP.tsv

output:

FO 28/03-14:54] Loading genes.raw.rsem.tsv done.
[INFO 28/03-14:54] Loading Homo_sapiens.GRCh38.87.gtf.lengths.Rdata
[INFO 28/03-14:54] Loading Homo_sapiens.GRCh38.87.gtf.lengths.Rdata done.
Read 2575494 rows and 9 (of 9) columns from 1.312 GB file in 00:00:20
GTF attributes  ensembl90 
[INFO 28/03-14:55] biotype col:gene_biotype
[INFO 28/03-14:55] GTF file loaded: Homo_sapiens.GRCh38.87.gtf 703935 entries
[INFO 28/03-14:55] Saved default_iRAP.tsv

So far so good, I get the same value as from a standard iRAP run, as expected. Now using protein_coding based normalization:

Script:

source /home/genomics/testsw/irap-1.0.0a/irap_setup.sh
irap_raw2metric -i genes.raw.rsem.tsv --lengths Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --gtf Homo_sapiens.GRCh38.87.gtf --feature gene --metric tpm --out protein-coding_iRAP.tsv --mass_biotypes protein_coding

output:

[INFO 28/03-15:36] Loading genes.raw.rsem.tsv
[INFO 28/03-15:36] Loading genes.raw.rsem.tsv done.
[INFO 28/03-15:36] Loading Homo_sapiens.GRCh38.87.gtf.lengths.Rdata
[INFO 28/03-15:36] Loading Homo_sapiens.GRCh38.87.gtf.lengths.Rdata done.
Read 2575494 rows and 9 (of 9) columns from 1.312 GB file in 00:00:19
GTF attributes  ensembl90 
[INFO 28/03-15:37] biotype col:gene_biotype
[INFO 28/03-15:37] GTF file loaded: Homo_sapiens.GRCh38.87.gtf 703935 entries
[INFO 28/03-15:37] Found protein_coding
[INFO 28/03-15:37] #Features used to compute the mass:702663
[INFO 28/03-15:37] Saved protein-coding_iRAP.tsv

Something is wrong because the this gives exactly the same results as before (58051 genes with the same mean expression.)

The only way I have been successful in renormalizing was by feeding the script a list of only protein coding genes (~19000 genes).

Am I still doing something wrong here? Should I be using a new Rdata-file, generated with iRAP-1.0.0a? By the way, the command above reports that using --mass-biotypes requires a GTF but I don't see it using it.

Highest regards,

Freek.

nunofonseca commented 6 years ago

Hi Freek,

Thank you for the detailed description. Indeed the parameter mass_biotypes was being ignored when computing TPMs - for rpkm, fpkm, fpkm-uq and uq-fpkm was working as expected. I suspect that my subconscious had some objections to change the TPM formula ;-) The code irap_raw2metric's code (in devel) was changed for the parameter mass_biotypes to work as expected while computing TPMs. A new release should come out by the end of the week. Please let me know if it works for you...or not. Cheers.

nunofonseca commented 6 years ago

Closing this issue since the new version is finally out. Cheers.