vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

Running mpmap then rpvg #4116

Open brettChapman opened 1 year ago

brettChapman commented 1 year ago

Hi

I have a pangenome graph with 5 genomes generated from minigraph-cactus.

I'm currently indexing with autoindex to use mpmap and rpvg.

When aligning RNA-seq reads should I align each set of reads from each sample, create multiple GAM files, then concatenate the GAM files, then convert to GAMP according to this wiki: https://github.com/vgteam/vg/wiki/Multipath-alignments-and-vg-mpmap

And then run rpvg? Thanks.

brettChapman commented 1 year ago

I'm finding the memory requirements for vg autoindex pushing the system requirements a bit. Running on a 1TB system I get a signal 9 error.

Would it be feasible to run on each chromosome graph separately, mpmap the RNA-seq data to each chromosome, concatenate the GAM files across all samples and chromosomes, convert to GAMP, then run rpvg?

brettChapman commented 1 year ago

Here is the error I get:

[vg autoindex] Executing command: /vg/bin/vg autoindex -t 16 -w mpmap -w rpvg -p barley_splice_pangenome_index -T /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/tmp -g /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/barley-pg/barley-pg.gfa -x /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/pangenome_reference.gtf -H /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/pangenome_haplotype.gtf
[IndexRegistry]: Checking for haplotype lines in GFA.
[IndexRegistry]: Constructing a GBZ from GFA input.
[IndexRegistry]: Constructing haplotype-transcript GBWT and spliced graph from GBZ-format graph.
    WARNING: Excluded 20245 transcripts with exon overlapping a haplotype break.
[IndexRegistry]: Joining transcript origin table.
[IndexRegistry]: Constructing spliced XG graph from spliced VG graph.
[IndexRegistry]: Constructing distance index for a spliced graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Size limit exceeded, construction aborted
[IndexRegistry]: Exceeded disk use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
error:[IndexRegistry] Child process 19852 signaled with status 9 representing signal 9
/var/spool/slurmd/job50527817/slurm_script: line 31: 52722 Killed                  singularity exec --bind ${PWD}:${PWD} $container vg autoindex -t 16 -w mpmap -w rpvg -p barley_splice_pangenome_index -T ${PWD}/tmp -g ${PWD}/barley-pg/barley-pg.gfa -x ${PWD}/pangenome_reference.gtf -H ${PWD}/pangenome_haplotype.gtf
jeizenga commented 1 year ago

For rpvg, you would really want the GAMP to be created natively as GAMP, not as GAM. If you plan to use vg mpmap, then GAMP is the default output, so there's no extra work. The conversion of GAM -> GAMP is lossless, but the output is only nominally "multipath", as every read alignment will consist of a single path. You can run rpvg on GAM input, but there is a modest hit to accuracy.

Unfotunately, you cannot really map separately to chromosomes and then combine the results. Technically, it's possible. However, some parts of the mapping algorithm assume that it can simultaneously view all of the high-scoring alignments, and they won't produce accurate results if this assumption is broken. For instance, you will have completely meaningless MAPQs, because each individual chromosome's run can't tell if there's another plausible mapping on a different chromosome. rpvg directly uses the MAPQs, so that's would probably be a significant problem.

Do you know for sure that this job was killed for memory use? That step is typically heavy on use of hard disk, but not especially RAM hungry.

brettChapman commented 1 year ago

Thanks @jeizenga

I'll look into the error a bit more and see if it was because of memory or disk usage. I did notice the tmp directory had very large files generated. I'm running on a cluster which is not managed by myself, so I've asked the admins to check.

brettChapman commented 1 year ago

In regards to running rpvg. Can I concatenate multiple GAMP files together? Would I run a different vg mpmap step for each RNA-seq sample, or should I simply concatenate all the Fastq files together from the 5 genomes prior to mapping to the 5 genome splice graph? I have RNA-seq data sequenced from each of the represented genomes in the graph.

jeizenga commented 1 year ago

Both GAM and GAMP files can be combined with cat. You could also combine the input FASTQs, and the results will be valid. You will probably get marginally better results running the FASTQs separately and then combining them post hoc since that will allow vg mpmap to estimate the fragment length distribution separately for each library. You might also be interested in annotating the reads with the --sample or --read-group options.

brettChapman commented 1 year ago

Hi @jeizenga it's been comfirmed that my last job died from OOM killer, so it was a memory limit issue. I'm now running with 2TB RAM and it seems to be getting further along, however it looks like it's going through iterative steps of pruning because the size limit keeps being exceeded. Is this size limit the index size? Can the size limit be increased? Will the pruning eventually be enough and the job continue on or could this loop of pruning never end? Thanks.

[vg autoindex] Executing command: /vg/bin/vg autoindex -t 32 -w mpmap -w rpvg -p barley_splice_pangenome_index -T /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/tmp -g /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/barley-pg/barley-pg.gfa -x /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/pangenome_reference.gtf -H /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/pangenome_haplotype.gtf
[IndexRegistry]: Checking for haplotype lines in GFA.
[IndexRegistry]: Constructing a GBZ from GFA input.
[IndexRegistry]: Constructing haplotype-transcript GBWT and spliced graph from GBZ-format graph.
    WARNING: Excluded 20245 transcripts with exon overlapping a haplotype break.
[IndexRegistry]: Joining transcript origin table.
[IndexRegistry]: Constructing spliced XG graph from spliced VG graph.
[IndexRegistry]: Constructing distance index for a spliced graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Size limit exceeded, construction aborted
[IndexRegistry]: Exceeded disk use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Size limit exceeded, construction aborted
[IndexRegistry]: Exceeded disk use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
brettChapman commented 1 year ago

Hi @jeizenga

It looks like the job finished. I received no more errors.

I have these output files:

barley_splice_pangenome_index.haplotx.gbwt
barley_splice_pangenome_index.txorigin.tsv
barley_splice_pangenome_index.spliced.xg
barley_splice_pangenome_index.spliced.dist
barley_splice_pangenome_index.spliced.gcsa
barley_splice_pangenome_index.spliced.gcsa.lcp

Are these all the files expected? I received no output to standard out saying the indexing had completed, apart from the pruning and constructing the GCSA/LCP as mentioned above.

jeizenga commented 1 year ago

If the .gcsa and .gcsa.lcp files aren't empty, then I believe it exited correctly. Did you give it a higher memory limit this time?

brettChapman commented 1 year ago

Yes, the gcsa and gcsa.lp files are large and not empty.

I used a 2TB memory node this time.

jeizenga commented 1 year ago

It's surprising to me that it would take so much memory in that step. I'll need to ask the author of that component why that might happen.

brettChapman commented 1 year ago

Ok thanks. I do have 5 genomes in the graph, and each genome is around 4 to 5Gbp in size.

Ideally we'd like to include all the genomes in our study, but given how difficult its been even with 5, I think it might be a while before pangenome indexing of that size is possible.

We've had to out source to a different compute service with enough resources to get the indexing done. Our usual available resources weren't enough.

jeizenga commented 1 year ago

How large did the GCSA and LCP files end up being? Their on-disk size is roughly the same as their in-memory size, so the answer has some bearing on what solutions to the memory use produce indexes that are usable in practice.

brettChapman commented 1 year ago

Hi @jeizenga

They're a decent size, with GCSA being 27G and LCP being 17GB:

1006M Oct 10 11:16 barley_splice_pangenome_index.haplotx.gbwt
31M Oct 10 11:16 barley_splice_pangenome_index.txorigin.tsv
17G Oct 10 11:38 barley_splice_pangenome_index.spliced.xg
11G Oct 10 12:37 barley_splice_pangenome_index.spliced.dist
27G Oct 12 18:06 barley_splice_pangenome_index.spliced.gcsa
17G Oct 12 18:06 barley_splice_pangenome_index.spliced.gcsa.lcp

Would index sizes like these be impracticable for use in practice?

I'm currently running mpmap with them and generating GAMP files. So far it's running. I'll then merge the GAMP. I have 3 different replicates for the RNA-seq reads, so I'll run rpvg separate for each replicate when looking at the quantification to get an idea on reproducibility. I was thinking of merging all replicates, but it might be better to run rpvg separate for each to see reproducibility.

jeizenga commented 1 year ago

Okay, that's definitely larger than usual but not deadly. Usually, the GCSA and LCP would be maybe 10-15 GB combined, but you can still use those in a server environment. I think the large indexes and high memory use probably have the same underlying cause, and they probably could both be solved by stricter pruning. I just need to figure out how to detect that behavior at run time.

brettChapman commented 1 year ago

Hi @jeizenga

I'm now running rpvg. All running ok so far.

Perhaps for autoindex if the user could specify a level on pruning strictness depending on their available resources? And maybe provide a report at the end indicating a recommendation to rerun again if resulting GCSA and LCP is too large, and suggest stricter pruning?

When I was running this on a more limited system I didn't even get to the pruning stage. Generating the splice graph and initial GCSA was too memory intensive to get this far. I had been running for weeks with no progress.

On this other resource we've paid for out of pocket (not government funded), we've managed to get the entire workflow done within a week. There is definitely a lot of IO overhead and in memory processing needed which pushed our other compute resources to their limit.

brettChapman commented 1 year ago

Hi @jeizenga

My rpvg runs have completed and I have the result files. I'm a bit confused on how to interpret.

I ran this command:

rpvg -t 32 -g ${GRAPH_XG} -p ${GRAPH_GBWT} -f ${TRANSCRIPT_ORIGIN} -a ${tissue}.gamp -o rpvg_${tissue}_results -i haplotype-transcripts

From the 3 genomes in the splice graph I have 3 RNA-seq datasets in triplicate I mapped using mpmap. I merged all the GAMP files together for each of the genomes and then ran rpvg. I have a result for each of the 3 replicates.

I get output like this:

Name    ClusterID       Length  EffectiveLength HaplotypeProbability    ReadCount       TPM
HORVU.AKASHINRIKI.CNS.4HG00302460.1_R1  1       994     739.03959       1       476508.29       4545.7705
HORVU.FT11.CNS.4HG00305750.1_R1 1       1000    745.03959       0       0       0
HORVU.HOR_13942.CNS.4HG00297900.1_R1    1       1043    788.03959       0       0       0
HORVU.MOREX.CNS.4HG00296860.1_R1        1       1158    903.03959       1       547351.97       4273.3124
HORVU.MOREX.CNS.4HG00296870.1_R1        1       838     583.03959       1       1157.6261       13.99829
HORVU.MOREX.CNS.4HG00296880.1_R1        1       1517    1262.0396       1       29.303144       0.16369899
HORVU.RGT_PLANET.CNS.4HG00301670.1_R1   1       1110    855.03959       0       0       0
HORVU.AKASHINRIKI.CNS.3HG00251930.1_R1  689     1416    1161.0396       1       95074.352       577.32597
HORVU.FT11.CNS.3HG00255940.1_R1 689     1416    1161.0396       1       22892.234       139.00995
HORVU.HOR_13942.CNS.3HG00248820.1_R1    689     1756    1501.0396       0       0       0
HORVU.MOREX.CNS.3HG00248620.1_R1        689     1585    1330.0396       0       0       0
HORVU.RGT_PLANET.CNS.3HG00251780.1_R1   689     1700    1445.0396       0       0       0
HORVU.AKASHINRIKI.CNS.7HG00578810.1_R1  715     3753    3498.0396       0       0       0
HORVU.FT11.CNS.7HG00587830.1_R1 715     3394    3139.0396       1       20492.958       46.026994
HORVU.HOR_13942.CNS.7HG00573720.1_R1    715     3731    3476.0396       0       0       0
HORVU.MOREX.CNS.7HG00573230.1_R1        715     3415    3160.0396       1       26294.089       58.663823
HORVU.RGT_PLANET.CNS.7HG00580230.1_R1   715     2094    1839.0396       0       0       0
HORVU.RGT_PLANET.CNS.7HG00580240.1_R1   715     660     405.03959       0       0       0
HORVU.AKASHINRIKI.CNS.1HG00017190.1_R1  766     4599    4344.0396       1       14991.627       24.330988
HORVU.HOR_13942.CNS.1HG00016970.1_R1    766     4695    4440.0396       1       26728.925       42.442348

And the joint output file:

Name_1  Name_2  ClusterID       HaplotypingProbability  ReadCount_1     TPM_1   ReadCount_2     TPM_2
HORVU.AKASHINRIKI.CNS.4HG00302460.1_R1  .       1       1       476508.29       4545.7705       0       0
HORVU.MOREX.CNS.4HG00296860.1_R1        .       1       1       547351.97       4273.3124       0       0
HORVU.MOREX.CNS.4HG00296870.1_R1        .       1       1       1157.6261       13.99829        0       0
HORVU.MOREX.CNS.4HG00296880.1_R1        .       1       1       29.303144       0.16369899      0       0
HORVU.AKASHINRIKI.CNS.3HG00251930.1_R1  .       689     1       95074.352       577.32597       0       0
HORVU.FT11.CNS.3HG00255940.1_R1 .       689     1       22892.234       139.00995       0       0
HORVU.FT11.CNS.7HG00587830.1_R1 .       715     1       20492.958       46.026994       0       0
HORVU.MOREX.CNS.7HG00573230.1_R1        .       715     1       26294.089       58.663823       0       0
HORVU.AKASHINRIKI.CNS.1HG00017190.1_R1  .       766     1       14991.627       24.330988       0       0
HORVU.HOR_13942.CNS.1HG00016970.1_R1    .       766     1       26728.925       42.442348       0       0
HORVU.FT11.CNS.5HG00434380.1_R1 .       822     1       26402.199       114.12488       0       0
HORVU.RGT_PLANET.CNS.5HG00428540.1_R1   .       822     1       6519.3843       25.548806       0       0
HORVU.AKASHINRIKI.CNS.4HG00352720.1_R1  .       862     1       35874.767       294.0867        0       0
HORVU.MOREX.CNS.4HG00347720.1_R1        .       862     1       14519.975       121.71783       0       0
HORVU.AKASHINRIKI.CNS.3HG00228650.1_R1  .       895     1       11778.602       21.87023        0       0
HORVU.HOR_13942.CNS.3HG00225550.1_R1    .       895     1       19248.84        38.696229       0       0
HORVU.HOR_13942.CNS.2HG00164950.1_R1    .       944     1       3036.5661       12.927566       0       0
HORVU.MOREX.CNS.2HG00164450.1_R1        .       944     1       49.950499       0.21407612      0       0
HORVU.MOREX.CNS.2HG00164460.1_R1        .       944     1       15991.333       75.766101       0       0
HORVU.HOR_13942.CNS.2HG00104200.1_R1    .       992     1       8834.0265       43.130485       0       0
HORVU.RGT_PLANET.CNS.2HG00105220.1_R1   .       992     1       19046.574       78.849125       0       0
HORVU.AKASHINRIKI.CNS.3HG00225490.1_R1  .       1039    1       15592.037       182.89614       0       0
HORVU.HOR_13942.CNS.3HG00222080.1_R1    .       1039    1       3549.8193       39.534853       0       0
HORVU.AKASHINRIKI.CNS.6HG00496840.1_R1  .       1075    1       1087.0563       9.5318971       0       0
HORVU.AKASHINRIKI.CNS.6HG00496850.1_R1  .       1075    1       11333.622       16.191342       0       0

What do the '_1' and '_2' represent in the joint file, two different calculations?

Is every transcript specific and unique to a single haplotype? I'm trying to make sense on how to interpret the results. Thanks.

brettChapman commented 1 year ago

I was thinking, should I do the quantification using rpvg on each aligned RNA-seq sample, or after I've merged the 5 different RNA-seg alignment GAMP files?

We have TPM values for conventional STAR and SALMON alignment and quantification to each individual genome, and I'm trying to compare the TPM values between convention approach and the haplotype-aware approach. Can they be compared?

brettChapman commented 1 year ago

Is there any advantage to using the haplotype-aware splice graph compared to mapping RNA-seq reads derived from the same species/strain to the same species/strain? We have 5 different genomes, with 5 different RNA-seq datasets derived from each of the 5 different genomes. You would expect that say Morex reads would align best to Morex genome in this example.

brettChapman commented 1 year ago

I have found that by merging the GAMP files and doing rpvg quantification, I can identify some uniquely expressed genes from the 5 genomes from that tissue. To do this I parse the results for each clusterID, look for any non zero TPM values and count the number of transcripts in that cluster.

jeizenga commented 1 year ago

Sorry, it looks like I missed your message from a few days ago.

What do the '_1' and '_2' represent in the joint file, two different calculations?

I believe these refer to the two haplotypes of this transcript. If the second transcript is null, then presumably rpvg is estimating the transcript to be homozygous.

Is every transcript specific and unique to a single haplotype?

More or less, yes. vg rna and rpvg consider haplotypes that only differ by intronic variants to be the same (because these differences don't show up in RNA-seq).

We have TPM values for conventional STAR and SALMON alignment and quantification to each individual genome, and I'm trying to compare the TPM values between convention approach and the haplotype-aware approach. Can they be compared?

To directly compare them, you would have to sum over the haplotypes to derive gene-level expression values.

Is there any advantage to using the haplotype-aware splice graph compared to mapping RNA-seq reads derived from the same species/strain to the same species/strain?

As long as you have annotations for all of them, the best reference for a sample is that sample's own genome. If you have a diploid species, then "own genome" is already plural, but a pangenome of its own two haplotypes would then be best. If you have a collapsed, haploid assembly for a diploid species, you could see some benefit from adding other assemblies, which might contain some of the collapsed variation. In general, pangenomes are most valuable when you know a lot about a population but not very much about the sample.

I have found that by merging the GAMP files and doing rpvg quantification, I can identify some uniquely expressed genes from the 5 genomes from that tissue.

Cool!

brettChapman commented 1 year ago

Hi @jeizenga

Thanks for the explanation.

Is there a way to get the mapping rates from mpmap? I'd like to identify what percentage of reads that were unmapped to compare to mapping rates of conventional approaches. Also can we identify the number of multi mapping reads and single mapping reads? Thanks.

brettChapman commented 1 year ago

I notice some TPM values to be quite low, and the corresponding read count to be very low. How are the read counts determined, and ideally should I apply a minimum cut off for read count before considering a transcript as being expressed?

brettChapman commented 1 year ago

I've noticed the TPM values in the joint file TPM_1 and TPM_2 where they aren't zero values, they're actually the exact same TPM and read count, which means I'm getting the exact TPM for each suspected heterozygous allele. Is this expected? My assemblies are collapsed primary assemblies, and are not haplotype resolved assemblies. I assume the RNA-seq picks up on the difference at the transcript level, but because the genome isn't haplotype resolved I'm getting the exact same expression values.

jeizenga commented 1 year ago

I notice some TPM values to be quite low, and the corresponding read count to be very low. How are the read counts determined, and ideally should I apply a minimum cut off for read count before considering a transcript as being expressed?

Generally speaking, TPM are more useful values than read counts. The read counts are also affected by the length of the transcript and the fragment length distribution. TPM are comparable across transcripts of different lengths and across different sequencing runs. Regarding the minimum, there already is one being applied. rpvg follows a convention in this field that TPM < 10^-8 is equivalent to not being expressed.

I've noticed the TPM values in the joint file TPM_1 and TPM_2 where they aren't zero values, they're actually the exact same TPM and read count, which means I'm getting the exact TPM for each suspected heterozygous allele. Is this expected?

My guess is that if the values are exactly the same, then either a) there is no variant distinguishing them or b) there were no reads containing the variant distinguishing them. I actually don't know off the top of my head how vg rna decides whether to merge or not merge two transcripts that are provided as reference annotations (i.e. not projected onto haplotypes). Maybe @jonassibbesen could weigh in on this point?

brettChapman commented 1 year ago

Hi @jeizenga thanks for the explanation.

I also notice Name_1 and Name_2 are exactly the same transcript. I thought perhaps the reads were picking up on differences for these transcripts, and identifying that these transcripts may contain heterozygous alleles, even though the TPM values are also exactly the same.

jeizenga commented 1 year ago

Can you share an example or two?

brettChapman commented 1 year ago

Here is an example of the joint results. You can see some values for TMP_2 are zero, and others have values, but _1 and _2 exactly the same. I wrote an awk script to check if they are exactly the same in every case, and they are.

HORVU.AKASHINRIKI.CNS.3HG00267660.1_R1  .       12895   0.01810418      11.728003       0.059915234     0       0
HORVU.HOR_13942.CNS.3HG00263530.1_R1    .       12895   0.0031012251    2.0084581       0.010231022     0       0
HORVU.MOREX.CNS.3HG00263830.1_R1        .       12895   0.97879459      634.45028       3.2625137       0       0
HORVU.RGT_PLANET.CNS.3HG00267280.1_R1   .       12895   1       322.67459       1.6448826       0       0
HORVU.MOREX.CNS.5HG00446250.1_R1        .       13039   1       298.05807       1.666391        0       0
HORVU.RGT_PLANET.CNS.5HG00450430.1_R1   .       13039   1       461.93613       2.482217        0       0
HORVU.FT11.CNS.5HG00388160.1_R1 .       13288   1       324.26428       1.1108365       0       0
HORVU.HOR_13942.CNS.5HG00378360.1_R1    .       13288   1       1040.7175       9.8482318       0       0
HORVU.FT11.CNS.6HG00495640.1_R1 .       13615   1       269.30481       1.9943154       0       0
HORVU.HOR_13942.CNS.6HG00485000.1_R1    .       13615   0.9638288       277.26906       1.599635        0       0
HORVU.MOREX.CNS.6HG00482800.1_R1        .       13615   0.036171202     10.401117       0.059279037     0       0
HORVU.AKASHINRIKI.CNS.4HG00303720.1_R1  .       13908   1       301.12783       0.63984396      0       0
HORVU.FT11.CNS.4HG00306910.1_R1 .       13908   1       347.86597       0.75999791      0       0
HORVU.MOREX.CNS.6HG00557450.1_R1        HORVU.MOREX.CNS.6HG00557450.1_R1        14442   1       96.121123       0.41472576      96.121123       0.41472576
HORVU.MOREX.CNS.6HG00557440.1_R1        HORVU.MOREX.CNS.6HG00557440.1_R1        14442   1       180.87518       0.67009425      180.87518       0.67009425
HORVU.FT11.CNS.5HG00449630.1_R1 .       14845   1       239.69594       0.9552738       0       0
HORVU.RGT_PLANET.CNS.5HG00443360.1_R1   .       14845   1       295.05617       0.95246482      0       0
HORVU.MOREX.CNS.2HG00147250.1_R1        HORVU.MOREX.CNS.2HG00147250.1_R1        15074   1       43.333178       0.12195811      43.333178       0.12195811
HORVU.MOREX.CNS.2HG00147240.1_R1        HORVU.MOREX.CNS.2HG00147240.1_R1        15074   1       283.16  1.5547417       283.16  1.5547417
HORVU.FT11.CNS.3HG00279880.1_R1 .       15254   1       199.64723       1.0326653       0       0
HORVU.RGT_PLANET.CNS.3HG00274730.1_R1   .       15254   1       262.34516       1.7531095       0       0
HORVU.FT11.CNS.6HG00491940.1_R1 .       15456   1       174.42454       0.70835785      0       0
HORVU.MOREX.CNS.6HG00479350.1_R1        .       15456   1       217.56085       1.0064432       0       0
HORVU.AKASHINRIKI.CNS.2HG00181600.1_R1  HORVU.AKASHINRIKI.CNS.2HG00181600.1_R1  15760   1       1005.8775       10.044326       1005.8775       10.044326
brettChapman commented 1 year ago

I've had a look at differences between the joint file and the other output file. There is some overlap with transcripts and TPM values. Is it safe to merge these files together? or should they be handled completely separate?

Below I have merged the columns. Some transcripts are repeated withe exact TPM values, and other have slightly different TPM values, I assume because these differences are attributed to the transcript having an alternative allele and is heterogenous. For example HORVU.AKASHINRIKI.CNS.1HG00000760.1 has a different TPM in joint vs other output file.

HORVU.AKASHINRIKI.CNS.1HG00000060.1_R1  9949    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000070.1_R1  3240    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000080.1_R1  3856    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000090.1_R1  3337    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000100.1_R1  3577    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000110.1_R1  1259    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000120.1_R1  6241    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000130.1_R1  29621   0.028664199     0       0
HORVU.AKASHINRIKI.CNS.1HG00000130.1_R1  29621   0.028664199     0       0
HORVU.AKASHINRIKI.CNS.1HG00000140.1_R1  1114    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000150.1_R1  201483  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000160.1_R1  491     0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000170.1_R1  6110    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000180.1_R1  201482  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000190.1_R1  201481  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000680.1_R1  6673    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000690.1_R1  1707    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000700.1_R1  9832    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000710.1_R1  201480  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000720.1_R1  201479  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000730.1_R1  201478  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000740.1_R1  201477  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000750.1_R1  201476  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000760.1_R1  20471   1       29      0.46102147
HORVU.AKASHINRIKI.CNS.1HG00000760.1_R1  20471   1       58      0.92204293
HORVU.AKASHINRIKI.CNS.1HG00000770.1_R1  2809    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000780.1_R1  2809    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000790.1_R1  1380    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000800.1_R1  186     0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000810.1_R1  201475  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000820.1_R1  1918    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000830.1_R1  201474  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000840.1_R1  4514    0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000850.1_R1  19547   0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000860.1_R1  201473  0       0       0
HORVU.AKASHINRIKI.CNS.1HG00000870.1_R1  28384   0.37608521      0.94450749      0.021338014
HORVU.AKASHINRIKI.CNS.1HG00000870.1_R1  28384   0.37608521      0.94450749      0.021338014
brettChapman commented 1 year ago

Another example

From rpvg_joint.txt file:

HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1  14798   0.25    17.624975       1.018912
HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1  14798   0.5     35.2499 2.0378212

From rpvg.txt file:

HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1  14798   0.75    70.49985        4.0756453

I notice here the joint file has 2 entries for same transcript first line is from where it has TPM values for TPM_1 and TPM_2 exactly the same, the second line only TPM_1 is filled, zero for TPM_2. Also two different haplotype probabilities for same transcript.

In the rpvg.txt file we have same transcript, its the sum of the haplotype probabilities in the joint file, but the TPM values dont sum up to match the joint file. Based on this, perhaps its best to treat rpvg.txt and rpvg_joint. results separately? Is one more accurate than the other? How should I make use of each result?

jeizenga commented 1 year ago

I believe the rpvg.txt is what you get if you sum up the TPM values for that transcript across every joint pair that it's included in from rpvg_joint.txt, so rpvg.txt is informational redundant with rpvg_joint.txt but not the other way around. It seems strange to me that you are seeing multiple reference transcripts with the same ID. Are there multiple copies of it in your input annotations?

brettChapman commented 1 year ago

To be sure, I checked the input haplotype GTF file and for transcript HORVU.AKASHINRIKI.CNS.1HG00002340.1 mentioned above there is only one occurrence of it in the GTF file.

If the joint pair should add up, to the values in rpvg.txt, then I think I see whats going on here:

HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1  HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1  14798   0.25    17.624975       1.018912        17.624975       1.018912
HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1  .       14798   0.5     35.2499 2.0378212       0       0

This is the exact output from the rpvg_joint.txt.

TPM_1, TPM_2 with probability 0.25 and the TPM value assigned to the next line, all add up to 4.

I don't understand why the joint file has 2 lines for the same transcript though. Does that mean that transcript HORVU.AKASHINRIKI.CNS.1HG00002340.1 0.25 probability of being hetergenous alelle and 0.5 probability of being homogenous allele?

Perhaps I should filter the probabilities, and look for anything 0.5 and above?

brettChapman commented 1 year ago

Since my TPM values are exactly the same in TPM_1 and TPM_2 in the joint file, perhaps I should only be using rpvg.txt output file? I find the number of transcripts in rpvg.txt is roughly doubly that of rpvg_joint.txt.