Compare command traceback

shenkers / isoscm

Transcript assembly tool using multiple change-point inference to improve 3'UTR annotation

13 stars 7 forks source link

Compare command traceback #7

Closed msxakk closed 9 years ago

msxakk commented 9 years ago

Dear IsoSCM developers,

I've successfully run assembly command and what to compare differential 3'UTR usage in my dataset. When I run compare command I get the following traceback:

2015-05-27T12:07:34,440 [main] INFO executable.IsoSCM - Starting compare ... Exception in thread "main" java.io.FileNotFoundException: /home/msxakk/Desktop/3primeUTR_lenght_work/A_assembly/tmp/NC.exon.trimmed.gtf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at util.IO.bufferedScanner(IO.java:198) at tools.ParseGTF$TranscriptIterator.(ParseGTF.java:266) at multisample.JointSegmentation.performJointSegmentation(JointSegmentation.java:161) at executable.IsoSCM.main(IsoSCM.java:613)

I used the following command: java -Xmx2048m -jar IsoSCM-2.0.7.jar compare -base test -x1 /home/msxakk/Desktop/3primeUTR_lenght_work/A_assembly/A.assembly_parameters.xml -x2 /home/msxakk/Desktop/3primeUTR_lenght_work/NC_assembly/NC.assembly_parameters.xml

It looks to me as if the program is trying to find the file NC.exon.trimmed.gtf but in wrong directory .../A_assembly/tmp/ . However it is in .../NC_assembly/tmp/

This appears to be a bug. Can you help?

Best wishes msxakk

shenkers commented 9 years ago

Hi msxakk,

You are correct, an updated version is available here: https://github.com/shenkers/isoscm/releases/tag/2.0.8

Let me know if this does not fix the issue.

Best, Sol

msxakk commented 9 years ago

It's running now with no trace back. We shall see... I presume it is ok to use the assembly made using v7 and compare using v8?

shenkers commented 9 years ago

yes, the only change I made was where it is looking for the files, so it should be fine.

msxakk commented 9 years ago

Is there a quick way to obtain gene name from the loci, i.e. how can I know the genes that have differential 3'UTR usage?

shenkers commented 9 years ago

there is nothing off the shelf that I'm aware of, although I agree that would be a nice extension. I can't promise a specific timeline, but I'll try and get an update for this in the next couple days...

msxakk commented 9 years ago

Hi Shenkers,

Thanks for your quick replies. May I ask you a couple more questions regarding the interpretation of compare command main .txt output and few general questions:

According to documentation the 'confidence' value is "the likelihood that a change-point occurs at the given position, compared to a null model that no change-point at this position". I'm a bit confused here. Is this the probability of observed data (that there is point-change) if null hypothesis were true (i.e. statistical P-value) or is it the probability that it is true that there is a point-change here? In other words is it better if this value is low or high.
It seems to me that likelihood value refers to change-point (whatever it is). I was wondering if there is a statistical value associated with any potentially observed differential (between conditions/states) 3UTR site usage?
I am trying to understand how the 3UTR usage is calculated. So you divide the depth downstream (step-down) by the depth upstream (step-up). So in my mind it is like the reads downstream are treated as % of upstream. Then you use this value to subtract from 1. That means that the more reads downstream relative to upstream (bigger %) then the lower final usage value. But this seems counter intuitive. You would think that the more reads generating from 3UTR region indicate bigger usage? I would appreciate if you could explain the concept of usage a little bit here. The differential usage would then be obvious.
I like the figure 6B in your paper. Can you tell me how you generated this matrix? I presume you used some bioconductor package? That will be useful when I report my data.

P.S Let me know if you prefer to open a new thread for this kind of discussion.

Thanks! msxakk

shenkers commented 9 years ago

I'm happy leaving the discussion here, I may re-organize things to create a wiki and create an FAQ section at some point in the future.

The confidence can be thought of as the complement of the p-value, i.e. confidence = 1-(p-value), so higher is better.
The current version of IsoSCM does not do this estimation, although several tools such as MISO or DEXSeq can be used to make these sorts of assessments. To use MISO or DEXSeq you would need to convert the output of IsoSCM into something that is compatible with these tools. IsoSCM, DEXSeq, and MISO focus on slightly different problems, and I would rather delegate the quantification to tools that specialize in isoform quantification rather than integrate everything in IsoSCM.
The usage is an estimate of the relative fraction of isoforms from the locus that use the polyadenylation site at the position of the change-point (as opposed to a polyadenylation site downstream). The usage will vary between 0 (when there is equal coverage up and down stream) and 1 (when there is zero coverage downstream). I think you're confused because you're interpreting the usage as "usage of the extended isoform", when it represents "1-(usage of the extended isoform)". I guess you looked at the explanation in the readme, so let me know if there were places where that explanation could be more clear.
The heatmap in 6B could be generated in R (although I used a different program). The colors are proportional to the log(p-value) of a test for differential usage at tandem APA sites. If there was not a trend towards lengthening or shortening, you would expect the differential usage to be symetrically distributed about 0. If there are more lengthening events (differential usage > 0) in one of the two tissues, you could test if this difference is significant compared to a null model that there would be an equal proportion of lengthening events in each tissue (by a binomial test, for example). The upper and lower triangles of the matrix represent the two one sided tests respectively.

msxakk commented 9 years ago

Hi Shenkers,

Thanks for that. I'll get back to this discussion in due course because it is important. I've got some other data to analyze for now and so can't properly reply.

Hear you soon msxakk

shenkers commented 9 years ago

@msxakk, I added the feature you requested (https://github.com/shenkers/isoscm/releases/tag/2.0.9), to identify loci that are identified by the 'compare' command that share structure with reference gene models.

To run this command type:

java -jar IsoSCM-2.0.9.jar diff -x [ compare_paramters.xml from compare step ] -G [ reference gene models GTF ]

The reference gene models have to use the same format as the Ensembl GTFs ( http://useast.ensembl.org/info/data/ftp/index.html ), ie they need "gene_id" and "transcript_id" attributes delimited using the same spacing characters. If you use gene models from another source you need to make sure that they match the Ensembl format or they will not be parsed correctly.

The output of this command will be a table with columns:

exon - the terminal exon from the compare command
locus_id - the locus_id, corresponding to the output of the compare command
match_5p_3p - the transcript_id's of reference models that match both the 5' and 3' boundaries of the assembled exon
match_5p_3p_type - the relative location of the matching exon in the reference transcript
match_5p - the transcript_id's of reference models that match only the 5' boundaries of the assembled exon
match_5p_type - the relative location of the matching exon in the reference transcript
match_3p - the transcript_id's of reference models that match only the 3' boundaries of the assembled exon
match_3p_type - the relative location of the matching exon in the reference transcript
overlaps - the transcript_id's of reference models that overlap the assembled exon
overlaps_type - the relative location of the matching exon in the reference transcript

The reference exon types will be one of [ 5p_exon, internal_exon, 3p_exon ] if it comes from a spliced transcript, and unspliced if the reference model is a single-exon transcript. While it is possible that IsoSCM will assemble novel 3'-terminal exons that were never previously annotated, the most conservative approach would be to focus on terminal exons that are consistent with existing gene models. You can do this by filtering for rows that have a match_5p_type of _3pexon, to select loci that exactly match the 5' boundary of an annotated terminal exon.

Let me know if your run into any problems.

Best, Sol

adomingues commented 9 years ago

Hi Sol,

just a follow up on some of @msxakk questions, in particular regarding the p-value. You mentioned that:

The confidence can be thought of as the complement of the p-value, i.e. confidence = 1-(p-value), so higher is better.

But if my reading is correct, confidence should never be above 1, because 0 > p-value < 1. However, I have some (1947 to be precise) confidence values > 1. Am I missing something?

To give yo some background, I am trying to filter the results of compare (currently > 30k different change points) to have only a set of UTRs that show differences between conditions that could be used for further analysis validation. Tentatively, downstream_cov smaller than the median will be removed, as will UTRs with small differential_usage. These can defined by looking at their distribution. However, I am struggling with the use of the confidence values.

Any feedback on how to filter the results is welcome.

(the fixes in v2.0.11 worked btw)

shenkers commented 9 years ago

Hi Antonio,

The issue with the confidence values is something I have noticed but not resolved yet. The confidence values are calculated by summing over the likelihoods of various 3'utr segmentations, comparing those that have a change point within a specified number of bins (parameter -c of the assemble command), with segmentations that have no change point within -c bins of that position. Values greater than 1.0 happen when the read coverage supports a model where two change points occur in close proximity (less than 2*c+1 bins apart). I agree that this isn't ideal, so I'll try to see if there is a better approach.

About filtering, I didn't understand what filter you are using for the downstream coverage. I would recommend a minimum upstream coverage in both conditions, as well as a minimum differential usage, as you were planning to use. Sometimes the differential segments will be very short, which I would be more cautious to believe without many replicated out otherc experimental evidence. If you want to be conservative about this case you could require that the downstream segment is above a threshold length.

Glad to hear 2.0.11 worked, thanks for letting me know.

adomingues commented 9 years ago

Values greater than 1.0 happen when the read coverage supports a model where two change points occur in close proximity (less than 2*c+1 bins apart). I agree that this isn't ideal, so I'll try to see if there is a better approach.

Got it. I was also surprised at the large number of confidence values == 1, about half (15954/30322), which was surprising as I naively expected the distribution to tail at 1:

confidence_distribution

So I am still unsure on which values to use. From what you told me values greater than 1.0 are harder to interpret, and the number of high confidence values, (=1.0), is a bit too high. Still, I guess that > 0.99 should be a good starting point.

About filtering, I didn't understand what filter you are using for the downstream coverage.

I calculate the median downstream_cov for both conditions, take the smallest value, and use this as the threshold for minimum coverage in both conditions. Basically, trying to take only "highish" coverage regions if that makes sense.

I would recommend a minimum upstream coverage in both conditions, as well as a minimum differential usage, as you were planning to use.

It's more or less what I am doing (above), except for downstream coverage, and minimum upstream is decided based on median coverage. The differential usage is arbitrarily set at a minimum of 0.20.

If you want to be conservative about this case you could require that the downstream segment is above a threshold length.

I have considered this. Good idea.

Thanks for your help.

shenkers commented 9 years ago

On Jul 22, 2015 4:29 AM, "FridayMeetsSunday" notifications@github.com wrote:

Got it. I was also surprised at the large number of confidence values == 1, about half (15954/30322), which was surprising as I naively expected the distribution to tail at 1:

The segmentation algorithm used by IsoSCM reports a maximum likelihood solution for the position of changepoints, after weighing all possible configurations (if you are familiar with HMMs it is analogous to a viterbi decoding). The confidence values are calculated post-hoc to try and capture how well the changepoint can be localized (imagine a sharp drop compared to a gradual decline in coverage). Since the confidence values are calculated after the fact from the maximum likelihood segmentation you should not expect a majority of low confidence change points with a tail of high confidence values.

So I am still unsure on which values to use. From what you told me values greater than 1.0 are harder to interpret, and the number of high confidence values, (=1.0), is a bit too high. Still, I guess that > 0.99 should be a good starting point.

Yes, although I would expect the values greater than one to represent real change points, so I would not discard them. If you set the assembly step '-c' parameter to zero you will get more conservative confidence values, and this should also eliminate confidence values greater than one.

About filtering, I didn't understand what filter you are using for the downstream coverage.

I calculate the median downstream_cov for both conditions, take the smallest value, and use this as the threshold for minimum coverage in both conditions. Basically, trying to take only "highish" coverage regions if that makes sense.

My only hesitation about doing it this way is that it requires the long isoform to be expressed in both conditions. At least when comparing tissues I have seen many examples where the long isoform is present in one tissue and completely absent from the other. If you require a minimum coverage in the downstream segment in both conditions you will filter out these cases. The differential usage is arbitrarily set at a minimum of 0.20.

Do you have replicates? If so, you could run the analysis comparing two replicates to empirically choose a cutoff. A priori you don't expect to see any differential usage between replicates, so you could choose a cutoff 'X' where 95% of change points between replicates have a differential usage less than X.

— Reply to this email directly or view it on GitHub.

adomingues commented 9 years ago

First of all thank you for the detailed answer. I spent some time looking at predicted events yesterday, following your suggestions, and I can report that by and large IsoSCM does indeed a very good job at predicting change points, or at least UTR's that undergoes differential expression. It tends to over-estimate the lengthening of 3'UTRs, that is predict UTRs tend to be longer than what is suggested by coverage, but you allude to that in the paper anyway (that preditions are not bp-resolution). It is still very valuable information. I will not reply point by point, but I will leave here my final filtering selection. The only thing I did not do was to use replicate comparison to determine the cut-off. I do have replicates, but long story short it would not be advisable to use them in this situation.

Filtering:

upstream_cov_WT > cutoff | upstream_cov_KD > cutoff, and cutoff being the lowest median value of upstream_cov of both samples. This might be a bit too strict but I wanted to be reasonably conservative. Higher than the 1st quartile should to the trick.
abs(differential_usage) > 0.20, maybe strict but helps to choose events for which one could see differences in PCR validation.
downstream segment length > 50. This is arbitrary. I plotted the distribution and there weren't that many very short downstream segment. 50 nt was what I considered reasonable for the species I am working with.
confidence > 0.99. I ended up not excluding those > 1 values, and the results look good, that is looking on IGV the coverage supports the predictions. Again, this is little too strict, but I wanted to have a set of regions that could immediately see differences between conditions (or not).

Applying these filters, 30322 events were reduced to 122. I hope this helps someone else using IsoSCM.

qoiopipq commented 8 years ago

I ran assemble command with c=2 (as default) then compare, I've got 8921 splicing events in total, 3827 of them have confidence values >=1. I saw the previous post about adjusting '-c' parameter, so I tried to set '-c' parameter to zero in assemble step, but I got the following trackback: 2016-01-27T14:20:48,581 [main] INFO executable.IsoSCM - Starting assemble ... 2016-01-27T14:20:49,465 [main] INFO executable.IsoSCM - tabulating splice junctions... 2016-01-27T14:21:51,295 [main] INFO executable.IsoSCM - counting junction supporting reads... 2016-01-27T14:31:14,485 [main] INFO executable.IsoSCM - identifying covered segments... 2016-01-27T14:48:21,038 [main] INFO executable.IsoSCM - merging coverage gaps less than 100 bp... 2016-01-27T14:48:28,718 [main] INFO executable.IsoSCM - filtering splice junctions... 2016-01-27T14:48:33,916 [main] INFO executable.IsoSCM - identifying spliced exons... 2016-01-27T14:48:37,654 [main] INFO executable.IsoSCM - identifying change points... Exception in thread "main" java.lang.IllegalArgumentException: fromIndex(15) > toIndex(14) at java.util.ArrayList.subListRangeCheck(ArrayList.java:1006) at java.util.ArrayList.subList(ArrayList.java:996) at scm.ConfidenceCalculator.calculateConfidence(ConfidenceCalculator.java:31) at scm.IdentifyChangePoints.identifyConstrainedNegativeBinomialPoints(IdentifyChangePoints.java:245) at scm.IdentifyChangePoints.identifyNegativeBinomialChangePointsInLongSegments(IdentifyChangePoints.java:660) at executable.IsoSCM.main(IsoSCM.java:358)

Any feedback on how to solve this problem is welcome.

shenkers commented 8 years ago

Hi @qoiopipq, thank you for reporting this, it appears to be a bug. I believe I have identified the problem, would it be possible for you to share the .bam file that causes this error so that I can verify the fix?

Regarding confidence values >=1, I think I will remove the -c parameter (so that 'compare' can only be run with c=0 and confidence values will be reported in the proper range).

p.s. you say "..I've got 8921 splicing events..." IsoSCM attempts to identify polyadenylation events, not splicing events.

p.p.s. your report belongs in a new issue since it is not related to the error reported on this thread. You probably wanted to reference confidence values >=1, so you posted here. Github has a built in mechanism to do this, if you want to reference another thread just use a hashtag for the issue number, for example if you want to reference this thread just type #7 and github will create a link to this page when you post your comment.