Closed lsoldini closed 2 years ago
Hi again rpvg-team,
As a follow-up, I would also have some related questions, but not linked to a crash:
I am using --inference-model haplotype-transcripts
, because I've found appealing the idea of first trying to give a probability to each diplotype, and then infer expression. I have diploid samples of known genotype, and at the end what I do really care about is the expression values. So, although I am not using the diplotype probabilities, should it still be better to use the haplotype-transcripts
model ?
About the -e
parameter: I have strand-specific library (Truseq illumina), but I did not use that parameter in my first tests. I guess it would be better to use it, but then I am concerned about whether any additional parameter should have been used in vg mpmap
?
I was wondering about this (in the "haplotype specific ..." paper) : "While HST expression estimates can always be marginalized to produce allele or transcript expression estimates, more general statistical frameworks will need to be developed to avoid information loss between these steps in transcriptomic pipelines". Due to complex distributions of sequencing data, I understand that it would be required to have models that would specifically fit the mathematics used by rpvg to infer expression and take benefit of all the associated probabilities it provide. But, since there is currently no software dedicated to differential expression analysis of rpvg
output, I was wondering what do you think would be the best software to perform differential expression analysis ? For instance, sleuth
would expect bootstrap values from kallisto
and would hence not fit, but maybe something similar to tximport
exists ? I will look more into the pipeline for salmon
and RSEM
, but I would greatly appreciate if you had any advice on this.
Still in the paper, it is written that: "For paired reads, these parameters are estimated from the alignment path lengths across all fragments that have 1) a mapping quality of at least 30 [...]". I think I have seen in other place in the method section about the use of a quality threshold around 30. I am assuming it is on a Phred-scale, and I was wondering about how this would behave if some/most of my reads have not such a high value ? They've been trimmed to have Phred-like score > 20, so I am wondering whether this was enough
Finally, maybe a naive question, but is there some relation between the sum of the ReadCount column to the total number of read mapped ?
My apologies for the many questions all at once, and thank you for your time reading me.
Best, Luca
I have now done several tests with rpvg
, but it keeps throwing the same error:
Assertion `best_align_score <= optimal_score' failed
The exit code is 134.
It's weird because it works on the example data.
In particular, I have three .gamp file (each one technical replicate = one different sequencing lane) for each biological replicate. Individually, some technical replicate do not throw an error, but as soon as I merge the technical replicates, it throws the same error for all samples.
I've tried whether doing the cat
step before or after vg mpmap
would change something, but it did not change anything and I got the same error.
And it does not seem to be linked to memory issues (e.g., each merged .gamp file is about 7 Gb), but I've added up to 64G RAM for one unique .gamp and the memory usage is low anyway.
Would you have any suggestion ?
Hi Luca,
Re the crash. What parameters did you use for mpmap
? Did you use the default scores? This error could happen if a different set of scores was used for mpmap
than the default. If you used the default then I would probably need to look at the data to find the issue. Would it be possible to share the input data and one of the gamp files that crashed? You can use this email: j.a.sibbesen@gmail.com
Re the general questions. Could you create a separate issue (or multiple if you prefer) with the question(s) since they are not related to the crash. Then it would be easier for other users to find the answers if they have similar questions. Thanks!
Best,
Jonas
Hi Jonas,
It is exactly what you said! I have used vg mpmap -e high
, but I should actually have used -e low
(i.e., the reads were trimmed such that Phred > 20, and most are > 30).
I have just realised this few hours ago, and I have re-run vg mpmap
and rpvg
. As of now, it is not finished (1/3 done), but it seemed to have worked just fine. I'll close this issue, and open new-ones for the other questions.
Thanks!
Best, Luca
1. What were you trying to do?
Infer expression from .gamp file from
vg mpmap
.I have a bunch of .fastq file (replicates and different treatments), and I runned them through an array in
vg mpmap
. The process seemed to have worked properly (exit: 0).I then made some test on
rpvg
using:With xxx being different samples.
2. What actually happened?
What is weird is that it worked for all but one sample (only 4 tested in total) - i.e., I wanted to try some stuff before running on whole data. Also, I have already re-run
vg mpmap
on that sample, but still got the same error.Here is the error message I get: