Large (>1TB) isoforms.gtf - enumerate

shenkers / isoscm

Transcript assembly tool using multiple change-point inference to improve 3'UTR annotation

13 stars 7 forks source link

Large (>1TB) isoforms.gtf - enumerate #10

Closed adomingues closed 9 years ago

adomingues commented 9 years ago

I am was running enumerate in relatively large datasets (~30GB bam), with -max_isoforms 20, but had to kill the job because it was generating >1TB isoforms.gtf. Is this to be expected and should -max_isoforms be set to a lower number?

Details: Zebrafish 100PE reads Mapping with STAR.

The compare command ran fine.

Cheers.

shenkers commented 9 years ago

Hi @adomingues, thanks for the report, >1TB does sound unusual.

I have the impression that the default parameters with STAR are more generous with mapping spliced reads than other aligners. If there were a lot of spuriously mapped spliced reads that could result in a large number of isoforms reported by enumerate.

I'm curious if IsoSCM is not correctly skipping loci with more than 20 isoforms, can you confirm that the output file (it should be [base].skipped_loci.txt) is not empty? Also, if you haven't deleted the [base].isoforms.gtf file, is it apparent that IsoSCM is stuck enumerating isoforms for one locus? i.e. if you look at the last lines appended to the file, do they all come from one locus?

If you're able to share the .gtf from the assembly step I would be happy to look at this in more detail.

-Sol

adomingues commented 9 years ago

HI @shenkers ,

you might be on to something.

can you confirm that the output file (it should be [base].skipped_loci.txt) is not empty?

The [base].skipped_loci.txt files are indeed empty.

Also, if you haven't deleted the [base].isoforms.gtf file, is it apparent that IsoSCM is stuck enumerating isoforms for one locus?

These files have been deleted, and I can't remember if they contained only one locus.

If you're able to share the .gtf from the assembly step I would be happy to look at this in more detail.

No problem, but can I share it via email instead of posting publicly?

shenkers commented 9 years ago

Of course! My email is sol.shenker@gmail.com

shenkers commented 9 years ago

I just made a new release (https://github.com/shenkers/isoscm/releases/tag/2.0.10) which should resolve this issue.

This issue occurs because the aligner (STAR) is assigning the read a strand flag (XS) that conflicts with the strand implied by the sequencing protocol.

Since these reads provide conflicting information and it's unclear what strand they really come from, I've modified IsoSCM such that splice junctions from reads with conflicting strand information are omitted from assembled transcript models. Let me know if you are still encountering any issues.

-Sol