shenkers / isoscm

Transcript assembly tool using multiple change-point inference to improve 3'UTR annotation
13 stars 7 forks source link

Compare command traceback #7

Closed msxakk closed 9 years ago

msxakk commented 9 years ago

Dear IsoSCM developers,

I've successfully run assembly command and what to compare differential 3'UTR usage in my dataset. When I run compare command I get the following traceback:

2015-05-27T12:07:34,440 [main] INFO executable.IsoSCM - Starting compare ... Exception in thread "main" java.io.FileNotFoundException: /home/msxakk/Desktop/3primeUTR_lenght_work/A_assembly/tmp/NC.exon.trimmed.gtf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at util.IO.bufferedScanner(IO.java:198) at tools.ParseGTF$TranscriptIterator.(ParseGTF.java:266) at multisample.JointSegmentation.performJointSegmentation(JointSegmentation.java:161) at executable.IsoSCM.main(IsoSCM.java:613)

I used the following command: java -Xmx2048m -jar IsoSCM-2.0.7.jar compare -base test -x1 /home/msxakk/Desktop/3primeUTR_lenght_work/A_assembly/A.assembly_parameters.xml -x2 /home/msxakk/Desktop/3primeUTR_lenght_work/NC_assembly/NC.assembly_parameters.xml

It looks to me as if the program is trying to find the file NC.exon.trimmed.gtf but in wrong directory .../A_assembly/tmp/ . However it is in .../NC_assembly/tmp/

This appears to be a bug. Can you help?

Best wishes msxakk

shenkers commented 9 years ago

Hi msxakk,

You are correct, an updated version is available here: https://github.com/shenkers/isoscm/releases/tag/2.0.8

Let me know if this does not fix the issue.

Best, Sol

msxakk commented 9 years ago

It's running now with no trace back. We shall see... I presume it is ok to use the assembly made using v7 and compare using v8?

shenkers commented 9 years ago

yes, the only change I made was where it is looking for the files, so it should be fine.

msxakk commented 9 years ago

Is there a quick way to obtain gene name from the loci, i.e. how can I know the genes that have differential 3'UTR usage?

shenkers commented 9 years ago

there is nothing off the shelf that I'm aware of, although I agree that would be a nice extension. I can't promise a specific timeline, but I'll try and get an update for this in the next couple days...

msxakk commented 9 years ago

Hi Shenkers,

Thanks for your quick replies. May I ask you a couple more questions regarding the interpretation of compare command main .txt output and few general questions:

P.S Let me know if you prefer to open a new thread for this kind of discussion.

Thanks! msxakk

shenkers commented 9 years ago

I'm happy leaving the discussion here, I may re-organize things to create a wiki and create an FAQ section at some point in the future.

msxakk commented 9 years ago

Hi Shenkers,

Thanks for that. I'll get back to this discussion in due course because it is important. I've got some other data to analyze for now and so can't properly reply.

Hear you soon msxakk

shenkers commented 9 years ago

@msxakk, I added the feature you requested (https://github.com/shenkers/isoscm/releases/tag/2.0.9), to identify loci that are identified by the 'compare' command that share structure with reference gene models.

To run this command type:

java -jar IsoSCM-2.0.9.jar diff -x [ compare_paramters.xml from compare step ] -G [ reference gene models GTF ]

The reference gene models have to use the same format as the Ensembl GTFs ( http://useast.ensembl.org/info/data/ftp/index.html ), ie they need "gene_id" and "transcript_id" attributes delimited using the same spacing characters. If you use gene models from another source you need to make sure that they match the Ensembl format or they will not be parsed correctly.

The output of this command will be a table with columns:

  1. exon - the terminal exon from the compare command
  2. locus_id - the locus_id, corresponding to the output of the compare command
  3. match_5p_3p - the transcript_id's of reference models that match both the 5' and 3' boundaries of the assembled exon
  4. match_5p_3p_type - the relative location of the matching exon in the reference transcript
  5. match_5p - the transcript_id's of reference models that match only the 5' boundaries of the assembled exon
  6. match_5p_type - the relative location of the matching exon in the reference transcript
  7. match_3p - the transcript_id's of reference models that match only the 3' boundaries of the assembled exon
  8. match_3p_type - the relative location of the matching exon in the reference transcript
  9. overlaps - the transcript_id's of reference models that overlap the assembled exon
  10. overlaps_type - the relative location of the matching exon in the reference transcript

The reference exon types will be one of [ 5p_exon, internal_exon, 3p_exon ] if it comes from a spliced transcript, and unspliced if the reference model is a single-exon transcript. While it is possible that IsoSCM will assemble novel 3'-terminal exons that were never previously annotated, the most conservative approach would be to focus on terminal exons that are consistent with existing gene models. You can do this by filtering for rows that have a match_5p_type of _3pexon, to select loci that exactly match the 5' boundary of an annotated terminal exon.

Let me know if your run into any problems.

Best, Sol

adomingues commented 9 years ago

Hi Sol,

just a follow up on some of @msxakk questions, in particular regarding the p-value. You mentioned that:

The confidence can be thought of as the complement of the p-value, i.e. confidence = 1-(p-value), so higher is better.

But if my reading is correct, confidence should never be above 1, because 0 > p-value < 1. However, I have some (1947 to be precise) confidence values > 1. Am I missing something?

To give yo some background, I am trying to filter the results of compare (currently > 30k different change points) to have only a set of UTRs that show differences between conditions that could be used for further analysis validation. Tentatively, downstream_cov smaller than the median will be removed, as will UTRs with small differential_usage. These can defined by looking at their distribution. However, I am struggling with the use of the confidence values.

Any feedback on how to filter the results is welcome.

(the fixes in v2.0.11 worked btw)

shenkers commented 9 years ago

Hi Antonio,

The issue with the confidence values is something I have noticed but not resolved yet. The confidence values are calculated by summing over the likelihoods of various 3'utr segmentations, comparing those that have a change point within a specified number of bins (parameter -c of the assemble command), with segmentations that have no change point within -c bins of that position. Values greater than 1.0 happen when the read coverage supports a model where two change points occur in close proximity (less than 2*c+1 bins apart). I agree that this isn't ideal, so I'll try to see if there is a better approach.

About filtering, I didn't understand what filter you are using for the downstream coverage. I would recommend a minimum upstream coverage in both conditions, as well as a minimum differential usage, as you were planning to use. Sometimes the differential segments will be very short, which I would be more cautious to believe without many replicated out otherc experimental evidence. If you want to be conservative about this case you could require that the downstream segment is above a threshold length.

Glad to hear 2.0.11 worked, thanks for letting me know.

adomingues commented 9 years ago

Values greater than 1.0 happen when the read coverage supports a model where two change points occur in close proximity (less than 2*c+1 bins apart). I agree that this isn't ideal, so I'll try to see if there is a better approach.

Got it. I was also surprised at the large number of confidence values == 1, about half (15954/30322), which was surprising as I naively expected the distribution to tail at 1:

confidence_distribution

So I am still unsure on which values to use. From what you told me values greater than 1.0 are harder to interpret, and the number of high confidence values, (=1.0), is a bit too high. Still, I guess that > 0.99 should be a good starting point.

About filtering, I didn't understand what filter you are using for the downstream coverage.

I calculate the median downstream_cov for both conditions, take the smallest value, and use this as the threshold for minimum coverage in both conditions. Basically, trying to take only "highish" coverage regions if that makes sense.

I would recommend a minimum upstream coverage in both conditions, as well as a minimum differential usage, as you were planning to use.

It's more or less what I am doing (above), except for downstream coverage, and minimum upstream is decided based on median coverage. The differential usage is arbitrarily set at a minimum of 0.20.

If you want to be conservative about this case you could require that the downstream segment is above a threshold length.

I have considered this. Good idea.

Thanks for your help.

shenkers commented 9 years ago

On Jul 22, 2015 4:29 AM, "FridayMeetsSunday" notifications@github.com wrote:

Got it. I was also surprised at the large number of confidence values == 1, about half (15954/30322), which was surprising as I naively expected the distribution to tail at 1:

The segmentation algorithm used by IsoSCM reports a maximum likelihood solution for the position of changepoints, after weighing all possible configurations (if you are familiar with HMMs it is analogous to a viterbi decoding). The confidence values are calculated post-hoc to try and capture how well the changepoint can be localized (imagine a sharp drop compared to a gradual decline in coverage). Since the confidence values are calculated after the fact from the maximum likelihood segmentation you should not expect a majority of low confidence change points with a tail of high confidence values.

So I am still unsure on which values to use. From what you told me values greater than 1.0 are harder to interpret, and the number of high confidence values, (=1.0), is a bit too high. Still, I guess that > 0.99 should be a good starting point.

Yes, although I would expect the values greater than one to represent real change points, so I would not discard them. If you set the assembly step '-c' parameter to zero you will get more conservative confidence values, and this should also eliminate confidence values greater than one.

About filtering, I didn't understand what filter you are using for the downstream coverage.

I calculate the median downstream_cov for both conditions, take the smallest value, and use this as the threshold for minimum coverage in both conditions. Basically, trying to take only "highish" coverage regions if that makes sense.

My only hesitation about doing it this way is that it requires the long isoform to be expressed in both conditions. At least when comparing tissues I have seen many examples where the long isoform is present in one tissue and completely absent from the other. If you require a minimum coverage in the downstream segment in both conditions you will filter out these cases. The differential usage is arbitrarily set at a minimum of 0.20.

Do you have replicates? If so, you could run the analysis comparing two replicates to empirically choose a cutoff. A priori you don't expect to see any differential usage between replicates, so you could choose a cutoff 'X' where 95% of change points between replicates have a differential usage less than X.

— Reply to this email directly or view it on GitHub.

adomingues commented 9 years ago

First of all thank you for the detailed answer. I spent some time looking at predicted events yesterday, following your suggestions, and I can report that by and large IsoSCM does indeed a very good job at predicting change points, or at least UTR's that undergoes differential expression. It tends to over-estimate the lengthening of 3'UTRs, that is predict UTRs tend to be longer than what is suggested by coverage, but you allude to that in the paper anyway (that preditions are not bp-resolution). It is still very valuable information. I will not reply point by point, but I will leave here my final filtering selection. The only thing I did not do was to use replicate comparison to determine the cut-off. I do have replicates, but long story short it would not be advisable to use them in this situation.

Filtering:

Applying these filters, 30322 events were reduced to 122. I hope this helps someone else using IsoSCM.

qoiopipq commented 8 years ago

I ran assemble command with c=2 (as default) then compare, I've got 8921 splicing events in total, 3827 of them have confidence values >=1. I saw the previous post about adjusting '-c' parameter, so I tried to set '-c' parameter to zero in assemble step, but I got the following trackback: 2016-01-27T14:20:48,581 [main] INFO executable.IsoSCM - Starting assemble ... 2016-01-27T14:20:49,465 [main] INFO executable.IsoSCM - tabulating splice junctions... 2016-01-27T14:21:51,295 [main] INFO executable.IsoSCM - counting junction supporting reads... 2016-01-27T14:31:14,485 [main] INFO executable.IsoSCM - identifying covered segments... 2016-01-27T14:48:21,038 [main] INFO executable.IsoSCM - merging coverage gaps less than 100 bp... 2016-01-27T14:48:28,718 [main] INFO executable.IsoSCM - filtering splice junctions... 2016-01-27T14:48:33,916 [main] INFO executable.IsoSCM - identifying spliced exons... 2016-01-27T14:48:37,654 [main] INFO executable.IsoSCM - identifying change points... Exception in thread "main" java.lang.IllegalArgumentException: fromIndex(15) > toIndex(14) at java.util.ArrayList.subListRangeCheck(ArrayList.java:1006) at java.util.ArrayList.subList(ArrayList.java:996) at scm.ConfidenceCalculator.calculateConfidence(ConfidenceCalculator.java:31) at scm.IdentifyChangePoints.identifyConstrainedNegativeBinomialPoints(IdentifyChangePoints.java:245) at scm.IdentifyChangePoints.identifyNegativeBinomialChangePointsInLongSegments(IdentifyChangePoints.java:660) at executable.IsoSCM.main(IsoSCM.java:358)

Any feedback on how to solve this problem is welcome.

shenkers commented 8 years ago

Hi @qoiopipq, thank you for reporting this, it appears to be a bug. I believe I have identified the problem, would it be possible for you to share the .bam file that causes this error so that I can verify the fix?

Regarding confidence values >=1, I think I will remove the -c parameter (so that 'compare' can only be run with c=0 and confidence values will be reported in the proper range).

p.s. you say "..I've got 8921 splicing events..." IsoSCM attempts to identify polyadenylation events, not splicing events.

p.p.s. your report belongs in a new issue since it is not related to the error reported on this thread. You probably wanted to reference confidence values >=1, so you posted here. Github has a built in mechanism to do this, if you want to reference another thread just use a hashtag for the issue number, for example if you want to reference this thread just type #7 and github will create a link to this page when you post your comment.