Model for rpvg's output

lsoldini commented 2 years ago

Overall question

I was wondering about model choice (i.e., haplotypes, transcripts, or haplotype-transcripts).

Model choice

I have RNA-seq data for which the diplotype of each individual is known (from qPCR genotyping).

First, from my understanding, there is no way to give the diplotype of each sample as a parameter of rpvg, is that correct ? (the idea would be to kind of enforce the diplotype)

Second, would you suggest using the haplotype-transcript (since it seems more accurate due to the diplotype conditionning) or simply the transcript model (since I will not later use the diplotype inference information) ?

So far, I have used the transcript model, but I was wondering if there would be strong reason to rather use the haplotype-transcript one.

jonassibbesen commented 2 years ago

Hi,

Sorry for the delayed reply. Just got back from vacation.

Yes, it is not possible to give as input the diplotype for a sample to rpvg. But what you could do instead is to subset the pantranscriptome to only include haplotype-specific transcripts from the sample’s diplotype or create a personal transcriptome for each sample. If you do that I would recommend that you use the transcripts model. You can use the haplotype-transcripts model, but nothing will be gained by the additional diplotyping step. The haplotype-transcripts model should be used if you do not know a sample’s diplotype beforehand and you want to estimate the most likely one in the pantranscriptome.

lsoldini commented 2 years ago

I see. So, as long as I do not make us of the diplotype estimation, I may as well use the transcripts model.

Also, I was wondering about the interpretation of some of the results. For instance, between the two haplotypes, there are regions that have undergone strong differenciation while others are essentially identical. For identical regions (e.g., fully identical gene A of few kbs) and when considering one sample, we can consider the differences in read counts between A1 and A2 (i.e., two haplotype-"specific" sequences of A, which are identical in this scenario) as artifacts because it only reflects the behaviour of rpvg when confronted with equally good mapping option, would you agree ? Then, if we know that these two sequences are identical, can't we just sum up the read counts (A1+A2) ?

jonassibbesen commented 2 years ago

If the sequences are completely identical, rpvg should estimate the same read count for A1 and A2, but of course in real data the number of sequenced reads will not always be exactly the same for each haplotype. Yes, you can just sum up the read counts if the sequences are identical. Another approach would be to use a pantranscriptome without redundant sequences.

lsoldini commented 2 years ago

Thank you. I have many regions (i.e., exons, introns, whole genes in some instances) that are identical but nested within region with high divergence. Would you have any suggestion on how to build a non-redundant pantranscriptome - i.e., is there any tool that automate this ?

jonassibbesen commented 2 years ago

Did you use the -c, --do-not-collapse option in vg rna when creating the pantranscriptome? If not, then I am actually a bit surprised that your pantranscriptome contains duplicates. By default vg rna should collapse identical haplotype-specific transcripts. Would it be possible for you to share the command line and input files you used for vg rna?

lsoldini commented 2 years ago

I used vg rna -p -t 10 -n x.gff -n y.gff -s Parent -r -u -b pantranscriptome.gbwt -i pantranscriptome.txt genome_mod.pg > splicepangenome.pg.

But, sorry, I think I was not clear. I meant I have many regions identical nested within region of high divergence in my assemblies, not what I have build with vg rna.

I think I misunderstood something but it seems clearer now. but just to be sure: when two sequences are fully identical, they are collapsed with ´vg rna´, so then rpvg will infer whatever value that will be further divided by two so each identical sequence obtains the exact same amount of reads ? Then, if in my DE analysis I do not want to have redundant information (e.g., a heterozygous diploid), I can collapse the identical regions by summing the reads for the two identical features ?

I was wondering, what is the size for collapsing ? Like how many nucleotide need to be equal before collapsing occurs ?

jonassibbesen commented 2 years ago

vg rna will collapse only completely identical haplotype-specific transcripts (HSTs). One thing I forgot to mention before, however, is that vg rna will only collapse identical HSTs from the same transcript, so if the same transcript is named something different in each assembly annotation then vg rna will keep all HSTs. In your case is the name (Parent attribute) the same in x.gff and y.gff for the same transcript? If not then it will not be collapsed by vg rna.

Regarding the read counts. rpvg will estimate a read count for each HST in the pantranscriptome independent of whether it was collapsed or not. If you have HSTs (or features) that were not collapsed by vg rna which you want to combine for downstream analyses then yes you can just sum the read counts. The same goes if you want gene expression values.

If you are interested you can look at the file pantranscriptome.txt to see which HSTs were identical across assemblies. The Haplotypes column will show which haplotypes each HST was created from and if it was collapsed more than one haplotype (or assembly in your case) will be specified using comma separation.

lsoldini commented 2 years ago

So collapsing of genes is based on names rather than sequence identity ? What I mean is, vg rna is not actively looking whether sequences are identical ? No, the names (Parent) are not identical because they are form different assemblies that were annotated separately.

Ok I see. I grepped "," in my pantranscriptome.txt and there was no match, so definitely nothing has been collapsed.

But then, this should not have influenced much of my analysis. Anyway, I summed all features in which we are not interested by differences while keeping the others.

Related question, do you have any idea of what could have caused the loss of genes during the construction of the pantranscriptome ? There are 7 genes that were not incorporated while all others were. (I will do another post on vg otherwise, but maybe you know)

jonassibbesen commented 2 years ago

vg rna does collapse HSTs based on sequence identity, but it will only do so between HSTs originating from the same transcript. But I can see why in cases like yours where you have annotations from multiple assemblies this might not be the best approach. There you probably would rather want to collapse all HSTs with identical sequences independent of origin.

Not sure why the genes are missing. Would you be able to share the input data and which genes it is? Then I can take a look. You can use this mail: jonas.sibbesen@sund.ku.dk No need to also post it on the vg GitHub. I wrote vg rna so it would be me that would answer it there anyway.

lsoldini commented 2 years ago

Yes exactly. When vg rna is looking at HST reciprocal identity between haplotypes/assemblies, it is using only the exons from the pantranscriptome.txt or it also looks at introns using the .pg file ? I am considering renaming the HST which are 100% identical between both assemblies, but I am not sure which identity level (exon, whole transcripts) I should use.

Alternatively, do you think there is some other way to make vg rna collapse everything ? I was wondering whether I could use the same name for all transcripts, and then, based on length and scaffold of origin, extract those that have been collapsed and then change the name only for those ones.

I'll send you that then, thanks!

jonassibbesen commented 2 years ago

It compares the exons. More specifically, it compares the paths of the transcripts in the spliced pangenome graph and collapses transcript paths that are identical (following the same set of nodes in the same direction and order). I can't think of an easy way that would work for collapsing across all transcripts using vg rna, but I think it would be good to have it as an option. I will see if I can get time to implement it in vg rna next week. Using the same name for all transcripts would not work since the names are used to tell which exons in the annotation are from which transcripts.

Also let me know when you have shared the data and I will take a look at your missing gene issue. Thanks!

lsoldini commented 2 years ago

It would be really neat to have such an option! For instance, I guess it may then be easier to infer the haplotypes of origin (i.e., in the view that we except higher number of accumulated mutations in the intronic region compared to the exon).

Yes, sorry, it's being a bit of a rush and I have to go back to those data. I will send it no later than today.

lsoldini commented 2 years ago

Folllow-up question: I was re-reading the vg rna description; what is the advantage of using the --introns option? Sorry it was not clear to me.

Also, I was wondering how vg mpmap actually does the mapping: it also maps the reads over the introns right ? So, if there are differences in the intronic regions between the two haplotypes, it will preferentially maps over the "right" haplotype ?

And I've just sent you an email with the files for the missing genes.

jonassibbesen commented 2 years ago

vg rna would still only use the exons when comparing the transcripts, but it will not be limited to only those with the same name (origin).

Regarding the --introns option. This option can be used if you want to add introns (or splice-junctions) from a separate database to the graph as new edges.

Yes, vg mpmap will map to everything that is in the graph including introns. However, the reads that map to the introns will not be used downstream by rpvg.

Thanks for the files.

lsoldini commented 2 years ago

Ah ok. Still, it would be a good improvement when annotations are not the same!

I see, thanks.

So, rpvg cannot make use of the intronic data when inferring haplotypes? E.g., let assume that two genes are 100% identical in terms of exons, but they have differences in their introns. Can rpvg make use of "diagnostic" reads that are overlapping between exons and introns? Or it will only use the reads that are fully on exons? (I guess it is simply a matter of threshold about when a read is considered intronic or not and which rpvg can use).

jonassibbesen commented 2 years ago

Yes, rpvg uses only exonic data when inferring the haplotypes. It does allow reads to have a couple of bases at their ends mapping to introns, but it will only use the exonic part of the read for inference. However, using intronic reads for haplotype inference is definitely something that would be interesting to explore at some point.

Regarding the missing transcripts, I think I have found the issue. It is related to a rare case when the boundaries of exons are on different strands, that is one is forward and the other reverse. vg rna is currently not able to handle those cases and the transcripts are therefore filtered. It is actually mostly an issue when projecting (liftover) transcripts between haplotypes which is not what you are doing. I am working on a fix so that it should work in your case where you have an annotation for each haplotype.

lsoldini commented 2 years ago

I am looking forward to a new feature using intronic reads, but it the meantime it's already great to have rpvg to infer expression from vg mpmap! I would really have been stuck without it.

Thanks for working on a fix. Will this also include the ability to collapse transcripts regardless of their name ? It would be really interesting (and, I think, having different annotation may become increasingly common - e.g., I already have another project for which I'll use vg toolkit, and it implies two assemblies as well).

lsoldini commented 2 years ago

Hello Jonas, as a small follow-up, I was wondering if you had any time to work on the above?

jonassibbesen commented 2 years ago

Hi, sorry for the silence. Due to illness and paper revisions I have not been able to work as much on this as I had hoped. I hope to get back to it next week.

lsoldini commented 2 years ago

Okay, thanks for the heads up, and I hope you recovered well.

jonassibbesen commented 2 years ago

Hi, I now have a PR that fixes the missing gene problem. There is currently an issue with the CI so it might be a day or two before I can merge it. I will keep you updated.

I will work on the ability to collapse identical transcripts regardless of name next.

lsoldini commented 2 years ago

Hi Jonas, great! Thank you for the update and for the patch. Looking forward to it!

jonassibbesen commented 2 years ago

I have just merged the fix for the missing genes.

lsoldini commented 2 years ago

Nice, thank you! I'll give it a try asap.

lsoldini commented 2 years ago

It tooked me more time than expected to get it worked (i.e., vg compilation on cluster), but it does now. Thanks!

Do you think you'll soon get the ability to collapse regardless of names done?

jonassibbesen commented 1 year ago

Yes, I am almost done. I ran into some unforeseen issues, which took longer to fix than expected. I expect to have it ready start of next week.

lsoldini commented 1 year ago

Great! Thanks, looking forward to it.

jonassibbesen commented 1 year ago

Hi, I finally had time to finish the new code for collapsing identical transcripts. I have just merged it into vg. You can enable the new option to collapse across all paths (including different transcripts) using -c all in vg rna.

lsoldini commented 1 year ago

Hi Jonas, great, thank you for making this new feature! I'll give it a try asap.

Just wondering, how does it work practically ? Each transcript path of haplotype A in the splicepangenome graph is compared against all transcript paths of haplotype B, and if they are 100% identical, those are collapsed ?

jonassibbesen commented 1 year ago

Sounds good. Let me know if you run into any problems.

In the new -c all option each transcript path is actually compared against all other transcript paths (including paths on the same haplotype). And yes then if two paths are 100% identical they are collapsed into one. Information on which transcripts were collapsed can be found in the transcript info table written using -i.

lsoldini commented 1 year ago

Okay, even better then (the comparison against path within the same haplotype)!

I'll let you know how this works out.

lsoldini commented 1 year ago

It seems to work nicely! I have 4175 occurences within my pantranscriptome.txt file (values very similar to what was obtained in another attempt to merge transcriptomes for use with kallisto). I have not finished the whole pggb to rpvg pipeline, but at least this part is promising. Thanks again!

lsoldini commented 1 year ago

Hi Jonas, I wanted to come back to something you said at the beginning of this issue:

But what you could do instead is to subset the pantranscriptome to only include haplotype-specific transcripts from the sample’s diplotype or create a personal transcriptome for each sample.

What would be the best between (i) modifying the pantranscriptome.txt (the one going into rpvg) dependending on each sample genotype ? (ii) running vg rna with transcriptome specific to each sample ?

I am thinking this may not lead to the same output (e.g., different vg mpmap output if the transcriptomes are not the same), but I would rather go with (i) which seems easier than re-doing the mapping etc. However, I'll go with the later if it's worth.

jonassibbesen commented 1 year ago

Hi, sorry for the delayed answer. I somehow missed your comment.

The GBWT index with transcript paths and the transcript info file have to match so option (i) is not possible. What you could do is to subset both the pantranscriptome GBWT file and info txt file. You can use vg gbwt -R for removing paths in a GBWT. The advantage of option (ii) is that you will have a personal pangenome graph for each sample, which might improve mapping. Maybe you could try it with a single sample to see if it helps before doing it for all samples.

lsoldini commented 1 year ago

Hi, no worries, and thanks for the reply. In the meantime, I have been re-thinking about my initial will to specify the haplotypes for each sample.

The goal is ultimately to perform DE analysis, and I am wondering whether this would actually make sense to perform DE (e.g., with edgeR) on rpvg output from samples mapped on different pantranscriptome ? This behaviour could be forced, but it seems wrong.

Second, if it were possible to do such DE analysis, wouldn't the output be biased ? My point is that I have two assemblies (A and B), and samples of AA, AB, and BB individuals. Some genomic features of both A and B assemblies are truly specific, whereas some other are artefacts (i.e., some SNPs found in assembly A but absent from assembly B are actually found in BB individuals). These artefacts are expected from the biological perspective (i.e., there is some gene flow between A and B). It seems that, by doing specific pangenome for each combination, I will have lower mapping in AA and BB individuals.

jonassibbesen commented 1 year ago

If you want to do DE analyses I also think that it would be better to use the same pantranscriptome for all samples.

Yes, it could be a problem if you expect that, for example, some of assembly A is present in the BB samples. If that is the case it would be better to create a pangenome from all assemblies and use that for all samples.

lsoldini commented 1 year ago

Thanks for the answer. Great, I will then continue with one pan-genome/-transcriptome for all. I think this is approaching to something quite concrete (I just have an ongoing issue with pggb graph complexity, some samples require to much memory and never finish running -but it's already on vg github).

lsoldini commented 1 year ago

Hello Jonas, Just a quick check that I got rpvg output right: the last line of rpvg output ("Uknown") corresponds to reads that were not mapped, and if this number is summed with read counts for each transcript, this gives the number of alignments in total (i.e., including both primary and secondary aligments). Would there be anyway to have more information on how many reads were uniquely mapped, non-mapped, etc. ? (I am comparing with kallisto output to get an idea of how better the vg mpmap | rpvg was)

jonassibbesen commented 1 year ago

Hi,

The “Unknown” transcript is an artificial noise transcript. The read count for this transcript includes both the unmapped reads and reads assigned to it during expression inference because of low mapping quality.

The “ReadCount” column in the output corresponds to fragments for paired-end input and reads for single-end. When you sum the "ReadCount" column it gives you the total number of unique reads/fragments in the input alignment file, which should correspond to the total number of sequenced reads/fragments

This information should probably have been part of the readme. Sorry about that.

Regarding getting statistics on the alignments. That is not possible using rpvg and I am not sure whether it is possible to currently get that for mpmap alignments using vg. My recommendation would be that you ask about it on the vg github.

lsoldini commented 1 year ago

Hi Jonas, thanks for your answer -sorry for my delayed answer. Great, so it kind of matches. Yes I already have an issue (#3672) on the vg github, but this may take longer to get fixed because I used pggb and it creates issues with vg surject and other commands to get stats on mapping... in the meantime I'll go with this!

jonassibbesen / rpvg

Model for rpvg's output #38