mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
229 stars 30 forks source link

Error in building gene/TE index #57

Closed alp78 closed 5 years ago

alp78 commented 5 years ago

Hello,

I keep getting stuck at the TE index build, for both TEcount and TEtranscripts commands.

I've tried to use the curated GRCh38_rmsk_TE.gtf:

1 hg38_rmsk exon 10469 11447 3612 - . gene_id "TAR1"; transcript_id "TAR1"; family_id "telo"; class_id "Satellite";

Or a custom one:

1 hg38_rmsk exon 67108754 67109046 1892.000000 + . gene_id "L1P5"; transcript_id "L1P5"; family_id "L1"; class_id "LINE";

Genome ref (ensembl release 98):

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; tag "basic"; transcript_support_level "1";

header of BAM aligned with STAR and sorted by name:

@HD VN:1.4 SO:queryname @SQ SN:1 LN:248956422 @SQ SN:10 LN:133797422 @SQ SN:11 LN:135086622

record in BAM:

NB501229:191:HHVH5BGX7:1:11101:1052:10205 83 6 45442470 255 148M = 45442433 -185 GATTTAAATAATTTCTACAGGAATGTGAAAACTTGGTTAGCAACATCATATTCTATAGAAGTGGTTCAGAGTTAGGGAAGAGGACTGATGGGAGAAGGTATTATTCGTTGTGCAGAGGATGGAAAAAGTAAATAAGCTTACATAGGAT <EEEEEEEE/AEEEEEA/EEAEE</AA/EEAEAEEAE/AEEAE/AAEEEEEAEEAAAE/EAEEE//EEEEA6EEE/EAEAAEEEEAEE/EEEEAEEEA/EEEEAEAEAEE/EEEEEEEEEEEEEAAEEEEEEEEAEEEEE/EEAAAA6 NH:i:1 HI:i:1 AS:i:203 nM:i:2

Command:

TEcount \ --stranded reverse \ -b SAMPLE_ByName.sorted.bam \ --GTF REF/GRCh38.gtf \ --TE REF/GRCh38_rmsk_TE.gtf \ --mode multi \ --project TETR/TETR_

installed TEtranscripts via pip in a conda environment.

Any suggestion?

Btw, is there a limit to the number of BAM files passed to TEtranscripts in both -t and -c ?

alp78 commented 5 years ago

nevermind i was in a python 3 env.. :)

alp78 commented 5 years ago

Just a suggestion: it would be nice to be able to build both indexes only once, and then reuse the indexes for each subsequent runs, rather than to have to wait that the exact same indexes build each time...

olivertam commented 5 years ago

Hi,

Thank you for the suggestion. This has been something that we've been testing for indices that are larger and more complex (e.g. TE locus/instance level), but could be useful for any indices.

Regarding your question about number of BAM files, there is no strict upper limit for the -t or -c options for TEtranscripts. The only limitation is the amount of available memory for the program, and the amount of time required.

Thanks again for your interest in the software.

alp78 commented 5 years ago

Well, I've been running TEtranscripts with 8 samples per condition, it took 2 days, and there was no output file in the end. there might be an issue, the logs show correct count but there was no output.

`num of multi reads = 221027 total multi counts = 189930.0 total multi counts = 189930.0 total means = 2868.88423963 after normalization total means = 1.0 alpha = 3.07478691684. Performing a stabilization step. num of multi reads = 221027 total multi counts = 189930.0 total multi counts = 189930.0 total means = 2867.74363986 after normalization total means = 1.0 alpha = 3.07478691684, SQUAREM iteraton [5] 1/3 num of multi reads = 221027 total multi counts = 189930.0 total multi counts = 189930.0 total means = 2867.61433148 after normalization total means = 1.0 2/3 num of multi reads = 221027 total multi counts = 189930.0 total multi counts = 189930.0 total means = 2867.50064388 after normalization total means = 1.0 alpha = 4.0. Performing a stabilization step. num of multi reads = 221027 total multi counts = 189930.0 total multi counts = 189930.0 total means = 2866.92573351 after normalization total means = 1.0 alpha = 4.0, SQUAREM iteraton [6] 1/3 num of multi reads = 221027 total multi counts = 189930.0 total multi counts = 189930.0 total means = 2866.89576315 after normalization total means = 1.0 2/3 num of multi reads = 221027 total multi counts = 189930.0 total multi counts = 189930.0 total means = 2866.870081 after normalization total means = 1.0 rNome = OPT_TOL converge at iteration 6 num of multi reads = 221027 total multi counts = 189930.0 TE counts total 6120530.0 Gene counts total 28820559.0463

In library CONTROL/STAR/RNA3041/ST_RNA3041_ByName.sorted.bam: Total annotated reads = 34941089.0463 Total non-uniquely mapped reads = 4790595 Total unannotated reads = 19249948`

Is the count the same when running multiple samples in TEtranscripts as when running one of the sample with TEcount?

It might be good to provide also the option to parallelize when input is a list of huge BAMs. Anyway no output after 2 days, so I guess i will drop TEtranscripts for now, and aggregate each TEcount...

olivertam commented 5 years ago

Hi,

The quantification steps for TEtranscripts and TEcount are identical and produces the same output. The "benefit" of using TEtranscripts is that it combines the libraries into a single count table, generates a DESeq2 script (and assigning each library to treatment/control), and runs it. We had considered parallelization in TEtranscripts itself, but felt that TEcount was probably a better approach as it can be used in parallel for multiple library, but also eliminates the need to re-run all the libraries every time a new sample is added to the list (which is more likely for longitudinal/cohort studies). It also looks like your libraries might be deeper than ours (based on the total annotated read count), and thus might be taking longer if you have to run in series. I completely agree with you that using TEcount and then aggregating might be the faster option.

Our apologies for your difficulties in using our software. Please let us know if you have other issues.

AIBio commented 4 years ago

Hello,

I keep getting stuck at the TE index build, for both TEcount and TEtranscripts commands.

I've tried to use the curated GRCh38_rmsk_TE.gtf:

1 hg38_rmsk exon 10469 11447 3612 - . gene_id "TAR1"; transcript_id "TAR1"; family_id "telo"; class_id "Satellite";

Or a custom one:

1 hg38_rmsk exon 67108754 67109046 1892.000000 + . gene_id "L1P5"; transcript_id "L1P5"; family_id "L1"; class_id "LINE";

Genome ref (ensembl release 98):

1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; tag "basic"; transcript_support_level "1";

header of BAM aligned with STAR and sorted by name:

@HD VN:1.4 SO:queryname @SQ SN:1 LN:248956422 @SQ SN:10 LN:133797422 @SQ SN:11 LN:135086622

record in BAM:

NB501229:191:HHVH5BGX7:1:11101:1052:10205 83 6 45442470 255 148M = 45442433 -185 GATTTAAATAATTTCTACAGGAATGTGAAAACTTGGTTAGCAACATCATATTCTATAGAAGTGGTTCAGAGTTAGGGAAGAGGACTGATGGGAGAAGGTATTATTCGTTGTGCAGAGGATGGAAAAAGTAAATAAGCTTACATAGGAT <EEEEEEEE/AEEEEEA/EEAEE</AA/EEAEAEEAE/AEEAE/AAEEEEEAEEAAAE/EAEEE//EEEEA6EEE/EAEAAEEEEAEE/EEEEAEEEA/EEEEAEAEAEE/EEEEEEEEEEEEEAAEEEEEEEEAEEEEE/EEAAAA6 NH:i:1 HI:i:1 AS:i:203 nM:i:2

Command:

TEcount \ --stranded reverse \ -b SAMPLE_ByName.sorted.bam \ --GTF REF/GRCh38.gtf \ --TE REF/GRCh38_rmsk_TE.gtf \ --mode multi \ --project TETR/TETR_

Keeping the chromosome names of TE gtf file consistent with the ones of GENE gtf file may solve this problem. More detail about chromosome names, please click here.