Open brettChapman opened 1 year ago
I'm finding the memory requirements for vg autoindex pushing the system requirements a bit. Running on a 1TB system I get a signal 9 error.
Would it be feasible to run on each chromosome graph separately, mpmap the RNA-seq data to each chromosome, concatenate the GAM files across all samples and chromosomes, convert to GAMP, then run rpvg?
Here is the error I get:
[vg autoindex] Executing command: /vg/bin/vg autoindex -t 16 -w mpmap -w rpvg -p barley_splice_pangenome_index -T /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/tmp -g /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/barley-pg/barley-pg.gfa -x /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/pangenome_reference.gtf -H /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/pangenome_haplotype.gtf
[IndexRegistry]: Checking for haplotype lines in GFA.
[IndexRegistry]: Constructing a GBZ from GFA input.
[IndexRegistry]: Constructing haplotype-transcript GBWT and spliced graph from GBZ-format graph.
WARNING: Excluded 20245 transcripts with exon overlapping a haplotype break.
[IndexRegistry]: Joining transcript origin table.
[IndexRegistry]: Constructing spliced XG graph from spliced VG graph.
[IndexRegistry]: Constructing distance index for a spliced graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Size limit exceeded, construction aborted
[IndexRegistry]: Exceeded disk use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
error:[IndexRegistry] Child process 19852 signaled with status 9 representing signal 9
/var/spool/slurmd/job50527817/slurm_script: line 31: 52722 Killed singularity exec --bind ${PWD}:${PWD} $container vg autoindex -t 16 -w mpmap -w rpvg -p barley_splice_pangenome_index -T ${PWD}/tmp -g ${PWD}/barley-pg/barley-pg.gfa -x ${PWD}/pangenome_reference.gtf -H ${PWD}/pangenome_haplotype.gtf
For rpvg
, you would really want the GAMP to be created natively as GAMP, not as GAM. If you plan to use vg mpmap
, then GAMP is the default output, so there's no extra work. The conversion of GAM -> GAMP is lossless, but the output is only nominally "multipath", as every read alignment will consist of a single path. You can run rpvg
on GAM input, but there is a modest hit to accuracy.
Unfotunately, you cannot really map separately to chromosomes and then combine the results. Technically, it's possible. However, some parts of the mapping algorithm assume that it can simultaneously view all of the high-scoring alignments, and they won't produce accurate results if this assumption is broken. For instance, you will have completely meaningless MAPQs, because each individual chromosome's run can't tell if there's another plausible mapping on a different chromosome. rpvg
directly uses the MAPQs, so that's would probably be a significant problem.
Do you know for sure that this job was killed for memory use? That step is typically heavy on use of hard disk, but not especially RAM hungry.
Thanks @jeizenga
I'll look into the error a bit more and see if it was because of memory or disk usage. I did notice the tmp directory had very large files generated. I'm running on a cluster which is not managed by myself, so I've asked the admins to check.
In regards to running rpvg. Can I concatenate multiple GAMP files together? Would I run a different vg mpmap step for each RNA-seq sample, or should I simply concatenate all the Fastq files together from the 5 genomes prior to mapping to the 5 genome splice graph? I have RNA-seq data sequenced from each of the represented genomes in the graph.
Both GAM and GAMP files can be combined with cat
. You could also combine the input FASTQs, and the results will be valid. You will probably get marginally better results running the FASTQs separately and then combining them post hoc since that will allow vg mpmap
to estimate the fragment length distribution separately for each library. You might also be interested in annotating the reads with the --sample
or --read-group
options.
Hi @jeizenga it's been comfirmed that my last job died from OOM killer, so it was a memory limit issue. I'm now running with 2TB RAM and it seems to be getting further along, however it looks like it's going through iterative steps of pruning because the size limit keeps being exceeded. Is this size limit the index size? Can the size limit be increased? Will the pruning eventually be enough and the job continue on or could this loop of pruning never end? Thanks.
[vg autoindex] Executing command: /vg/bin/vg autoindex -t 32 -w mpmap -w rpvg -p barley_splice_pangenome_index -T /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/tmp -g /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/barley-pg/barley-pg.gfa -x /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/pangenome_reference.gtf -H /data/murdoch_wcgapangenome/minigraph-cactus_splice_graph/pangenome_haplotype.gtf
[IndexRegistry]: Checking for haplotype lines in GFA.
[IndexRegistry]: Constructing a GBZ from GFA input.
[IndexRegistry]: Constructing haplotype-transcript GBWT and spliced graph from GBZ-format graph.
WARNING: Excluded 20245 transcripts with exon overlapping a haplotype break.
[IndexRegistry]: Joining transcript origin table.
[IndexRegistry]: Constructing spliced XG graph from spliced VG graph.
[IndexRegistry]: Constructing distance index for a spliced graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Size limit exceeded, construction aborted
[IndexRegistry]: Exceeded disk use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Size limit exceeded, construction aborted
[IndexRegistry]: Exceeded disk use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
Hi @jeizenga
It looks like the job finished. I received no more errors.
I have these output files:
barley_splice_pangenome_index.haplotx.gbwt
barley_splice_pangenome_index.txorigin.tsv
barley_splice_pangenome_index.spliced.xg
barley_splice_pangenome_index.spliced.dist
barley_splice_pangenome_index.spliced.gcsa
barley_splice_pangenome_index.spliced.gcsa.lcp
Are these all the files expected? I received no output to standard out saying the indexing had completed, apart from the pruning and constructing the GCSA/LCP as mentioned above.
If the .gcsa
and .gcsa.lcp
files aren't empty, then I believe it exited correctly. Did you give it a higher memory limit this time?
Yes, the gcsa and gcsa.lp files are large and not empty.
I used a 2TB memory node this time.
It's surprising to me that it would take so much memory in that step. I'll need to ask the author of that component why that might happen.
Ok thanks. I do have 5 genomes in the graph, and each genome is around 4 to 5Gbp in size.
Ideally we'd like to include all the genomes in our study, but given how difficult its been even with 5, I think it might be a while before pangenome indexing of that size is possible.
We've had to out source to a different compute service with enough resources to get the indexing done. Our usual available resources weren't enough.
How large did the GCSA and LCP files end up being? Their on-disk size is roughly the same as their in-memory size, so the answer has some bearing on what solutions to the memory use produce indexes that are usable in practice.
Hi @jeizenga
They're a decent size, with GCSA being 27G and LCP being 17GB:
1006M Oct 10 11:16 barley_splice_pangenome_index.haplotx.gbwt
31M Oct 10 11:16 barley_splice_pangenome_index.txorigin.tsv
17G Oct 10 11:38 barley_splice_pangenome_index.spliced.xg
11G Oct 10 12:37 barley_splice_pangenome_index.spliced.dist
27G Oct 12 18:06 barley_splice_pangenome_index.spliced.gcsa
17G Oct 12 18:06 barley_splice_pangenome_index.spliced.gcsa.lcp
Would index sizes like these be impracticable for use in practice?
I'm currently running mpmap with them and generating GAMP files. So far it's running. I'll then merge the GAMP. I have 3 different replicates for the RNA-seq reads, so I'll run rpvg separate for each replicate when looking at the quantification to get an idea on reproducibility. I was thinking of merging all replicates, but it might be better to run rpvg separate for each to see reproducibility.
Okay, that's definitely larger than usual but not deadly. Usually, the GCSA and LCP would be maybe 10-15 GB combined, but you can still use those in a server environment. I think the large indexes and high memory use probably have the same underlying cause, and they probably could both be solved by stricter pruning. I just need to figure out how to detect that behavior at run time.
Hi @jeizenga
I'm now running rpvg. All running ok so far.
Perhaps for autoindex if the user could specify a level on pruning strictness depending on their available resources? And maybe provide a report at the end indicating a recommendation to rerun again if resulting GCSA and LCP is too large, and suggest stricter pruning?
When I was running this on a more limited system I didn't even get to the pruning stage. Generating the splice graph and initial GCSA was too memory intensive to get this far. I had been running for weeks with no progress.
On this other resource we've paid for out of pocket (not government funded), we've managed to get the entire workflow done within a week. There is definitely a lot of IO overhead and in memory processing needed which pushed our other compute resources to their limit.
Hi @jeizenga
My rpvg runs have completed and I have the result files. I'm a bit confused on how to interpret.
I ran this command:
rpvg -t 32 -g ${GRAPH_XG} -p ${GRAPH_GBWT} -f ${TRANSCRIPT_ORIGIN} -a ${tissue}.gamp -o rpvg_${tissue}_results -i haplotype-transcripts
From the 3 genomes in the splice graph I have 3 RNA-seq datasets in triplicate I mapped using mpmap. I merged all the GAMP files together for each of the genomes and then ran rpvg. I have a result for each of the 3 replicates.
I get output like this:
Name ClusterID Length EffectiveLength HaplotypeProbability ReadCount TPM
HORVU.AKASHINRIKI.CNS.4HG00302460.1_R1 1 994 739.03959 1 476508.29 4545.7705
HORVU.FT11.CNS.4HG00305750.1_R1 1 1000 745.03959 0 0 0
HORVU.HOR_13942.CNS.4HG00297900.1_R1 1 1043 788.03959 0 0 0
HORVU.MOREX.CNS.4HG00296860.1_R1 1 1158 903.03959 1 547351.97 4273.3124
HORVU.MOREX.CNS.4HG00296870.1_R1 1 838 583.03959 1 1157.6261 13.99829
HORVU.MOREX.CNS.4HG00296880.1_R1 1 1517 1262.0396 1 29.303144 0.16369899
HORVU.RGT_PLANET.CNS.4HG00301670.1_R1 1 1110 855.03959 0 0 0
HORVU.AKASHINRIKI.CNS.3HG00251930.1_R1 689 1416 1161.0396 1 95074.352 577.32597
HORVU.FT11.CNS.3HG00255940.1_R1 689 1416 1161.0396 1 22892.234 139.00995
HORVU.HOR_13942.CNS.3HG00248820.1_R1 689 1756 1501.0396 0 0 0
HORVU.MOREX.CNS.3HG00248620.1_R1 689 1585 1330.0396 0 0 0
HORVU.RGT_PLANET.CNS.3HG00251780.1_R1 689 1700 1445.0396 0 0 0
HORVU.AKASHINRIKI.CNS.7HG00578810.1_R1 715 3753 3498.0396 0 0 0
HORVU.FT11.CNS.7HG00587830.1_R1 715 3394 3139.0396 1 20492.958 46.026994
HORVU.HOR_13942.CNS.7HG00573720.1_R1 715 3731 3476.0396 0 0 0
HORVU.MOREX.CNS.7HG00573230.1_R1 715 3415 3160.0396 1 26294.089 58.663823
HORVU.RGT_PLANET.CNS.7HG00580230.1_R1 715 2094 1839.0396 0 0 0
HORVU.RGT_PLANET.CNS.7HG00580240.1_R1 715 660 405.03959 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00017190.1_R1 766 4599 4344.0396 1 14991.627 24.330988
HORVU.HOR_13942.CNS.1HG00016970.1_R1 766 4695 4440.0396 1 26728.925 42.442348
And the joint output file:
Name_1 Name_2 ClusterID HaplotypingProbability ReadCount_1 TPM_1 ReadCount_2 TPM_2
HORVU.AKASHINRIKI.CNS.4HG00302460.1_R1 . 1 1 476508.29 4545.7705 0 0
HORVU.MOREX.CNS.4HG00296860.1_R1 . 1 1 547351.97 4273.3124 0 0
HORVU.MOREX.CNS.4HG00296870.1_R1 . 1 1 1157.6261 13.99829 0 0
HORVU.MOREX.CNS.4HG00296880.1_R1 . 1 1 29.303144 0.16369899 0 0
HORVU.AKASHINRIKI.CNS.3HG00251930.1_R1 . 689 1 95074.352 577.32597 0 0
HORVU.FT11.CNS.3HG00255940.1_R1 . 689 1 22892.234 139.00995 0 0
HORVU.FT11.CNS.7HG00587830.1_R1 . 715 1 20492.958 46.026994 0 0
HORVU.MOREX.CNS.7HG00573230.1_R1 . 715 1 26294.089 58.663823 0 0
HORVU.AKASHINRIKI.CNS.1HG00017190.1_R1 . 766 1 14991.627 24.330988 0 0
HORVU.HOR_13942.CNS.1HG00016970.1_R1 . 766 1 26728.925 42.442348 0 0
HORVU.FT11.CNS.5HG00434380.1_R1 . 822 1 26402.199 114.12488 0 0
HORVU.RGT_PLANET.CNS.5HG00428540.1_R1 . 822 1 6519.3843 25.548806 0 0
HORVU.AKASHINRIKI.CNS.4HG00352720.1_R1 . 862 1 35874.767 294.0867 0 0
HORVU.MOREX.CNS.4HG00347720.1_R1 . 862 1 14519.975 121.71783 0 0
HORVU.AKASHINRIKI.CNS.3HG00228650.1_R1 . 895 1 11778.602 21.87023 0 0
HORVU.HOR_13942.CNS.3HG00225550.1_R1 . 895 1 19248.84 38.696229 0 0
HORVU.HOR_13942.CNS.2HG00164950.1_R1 . 944 1 3036.5661 12.927566 0 0
HORVU.MOREX.CNS.2HG00164450.1_R1 . 944 1 49.950499 0.21407612 0 0
HORVU.MOREX.CNS.2HG00164460.1_R1 . 944 1 15991.333 75.766101 0 0
HORVU.HOR_13942.CNS.2HG00104200.1_R1 . 992 1 8834.0265 43.130485 0 0
HORVU.RGT_PLANET.CNS.2HG00105220.1_R1 . 992 1 19046.574 78.849125 0 0
HORVU.AKASHINRIKI.CNS.3HG00225490.1_R1 . 1039 1 15592.037 182.89614 0 0
HORVU.HOR_13942.CNS.3HG00222080.1_R1 . 1039 1 3549.8193 39.534853 0 0
HORVU.AKASHINRIKI.CNS.6HG00496840.1_R1 . 1075 1 1087.0563 9.5318971 0 0
HORVU.AKASHINRIKI.CNS.6HG00496850.1_R1 . 1075 1 11333.622 16.191342 0 0
What do the '_1' and '_2' represent in the joint file, two different calculations?
Is every transcript specific and unique to a single haplotype? I'm trying to make sense on how to interpret the results. Thanks.
I was thinking, should I do the quantification using rpvg on each aligned RNA-seq sample, or after I've merged the 5 different RNA-seg alignment GAMP files?
We have TPM values for conventional STAR and SALMON alignment and quantification to each individual genome, and I'm trying to compare the TPM values between convention approach and the haplotype-aware approach. Can they be compared?
Is there any advantage to using the haplotype-aware splice graph compared to mapping RNA-seq reads derived from the same species/strain to the same species/strain? We have 5 different genomes, with 5 different RNA-seq datasets derived from each of the 5 different genomes. You would expect that say Morex reads would align best to Morex genome in this example.
I have found that by merging the GAMP files and doing rpvg quantification, I can identify some uniquely expressed genes from the 5 genomes from that tissue. To do this I parse the results for each clusterID, look for any non zero TPM values and count the number of transcripts in that cluster.
Sorry, it looks like I missed your message from a few days ago.
What do the '_1' and '_2' represent in the joint file, two different calculations?
I believe these refer to the two haplotypes of this transcript. If the second transcript is null, then presumably rpvg
is estimating the transcript to be homozygous.
Is every transcript specific and unique to a single haplotype?
More or less, yes. vg rna
and rpvg
consider haplotypes that only differ by intronic variants to be the same (because these differences don't show up in RNA-seq).
We have TPM values for conventional STAR and SALMON alignment and quantification to each individual genome, and I'm trying to compare the TPM values between convention approach and the haplotype-aware approach. Can they be compared?
To directly compare them, you would have to sum over the haplotypes to derive gene-level expression values.
Is there any advantage to using the haplotype-aware splice graph compared to mapping RNA-seq reads derived from the same species/strain to the same species/strain?
As long as you have annotations for all of them, the best reference for a sample is that sample's own genome. If you have a diploid species, then "own genome" is already plural, but a pangenome of its own two haplotypes would then be best. If you have a collapsed, haploid assembly for a diploid species, you could see some benefit from adding other assemblies, which might contain some of the collapsed variation. In general, pangenomes are most valuable when you know a lot about a population but not very much about the sample.
I have found that by merging the GAMP files and doing rpvg quantification, I can identify some uniquely expressed genes from the 5 genomes from that tissue.
Cool!
Hi @jeizenga
Thanks for the explanation.
Is there a way to get the mapping rates from mpmap? I'd like to identify what percentage of reads that were unmapped to compare to mapping rates of conventional approaches. Also can we identify the number of multi mapping reads and single mapping reads? Thanks.
I notice some TPM values to be quite low, and the corresponding read count to be very low. How are the read counts determined, and ideally should I apply a minimum cut off for read count before considering a transcript as being expressed?
I've noticed the TPM values in the joint file TPM_1 and TPM_2 where they aren't zero values, they're actually the exact same TPM and read count, which means I'm getting the exact TPM for each suspected heterozygous allele. Is this expected? My assemblies are collapsed primary assemblies, and are not haplotype resolved assemblies. I assume the RNA-seq picks up on the difference at the transcript level, but because the genome isn't haplotype resolved I'm getting the exact same expression values.
I notice some TPM values to be quite low, and the corresponding read count to be very low. How are the read counts determined, and ideally should I apply a minimum cut off for read count before considering a transcript as being expressed?
Generally speaking, TPM are more useful values than read counts. The read counts are also affected by the length of the transcript and the fragment length distribution. TPM are comparable across transcripts of different lengths and across different sequencing runs. Regarding the minimum, there already is one being applied. rpvg
follows a convention in this field that TPM < 10^-8 is equivalent to not being expressed.
I've noticed the TPM values in the joint file TPM_1 and TPM_2 where they aren't zero values, they're actually the exact same TPM and read count, which means I'm getting the exact TPM for each suspected heterozygous allele. Is this expected?
My guess is that if the values are exactly the same, then either a) there is no variant distinguishing them or b) there were no reads containing the variant distinguishing them. I actually don't know off the top of my head how vg rna
decides whether to merge or not merge two transcripts that are provided as reference annotations (i.e. not projected onto haplotypes). Maybe @jonassibbesen could weigh in on this point?
Hi @jeizenga thanks for the explanation.
I also notice Name_1 and Name_2 are exactly the same transcript. I thought perhaps the reads were picking up on differences for these transcripts, and identifying that these transcripts may contain heterozygous alleles, even though the TPM values are also exactly the same.
Can you share an example or two?
Here is an example of the joint results. You can see some values for TMP_2 are zero, and others have values, but _1 and _2 exactly the same. I wrote an awk script to check if they are exactly the same in every case, and they are.
HORVU.AKASHINRIKI.CNS.3HG00267660.1_R1 . 12895 0.01810418 11.728003 0.059915234 0 0
HORVU.HOR_13942.CNS.3HG00263530.1_R1 . 12895 0.0031012251 2.0084581 0.010231022 0 0
HORVU.MOREX.CNS.3HG00263830.1_R1 . 12895 0.97879459 634.45028 3.2625137 0 0
HORVU.RGT_PLANET.CNS.3HG00267280.1_R1 . 12895 1 322.67459 1.6448826 0 0
HORVU.MOREX.CNS.5HG00446250.1_R1 . 13039 1 298.05807 1.666391 0 0
HORVU.RGT_PLANET.CNS.5HG00450430.1_R1 . 13039 1 461.93613 2.482217 0 0
HORVU.FT11.CNS.5HG00388160.1_R1 . 13288 1 324.26428 1.1108365 0 0
HORVU.HOR_13942.CNS.5HG00378360.1_R1 . 13288 1 1040.7175 9.8482318 0 0
HORVU.FT11.CNS.6HG00495640.1_R1 . 13615 1 269.30481 1.9943154 0 0
HORVU.HOR_13942.CNS.6HG00485000.1_R1 . 13615 0.9638288 277.26906 1.599635 0 0
HORVU.MOREX.CNS.6HG00482800.1_R1 . 13615 0.036171202 10.401117 0.059279037 0 0
HORVU.AKASHINRIKI.CNS.4HG00303720.1_R1 . 13908 1 301.12783 0.63984396 0 0
HORVU.FT11.CNS.4HG00306910.1_R1 . 13908 1 347.86597 0.75999791 0 0
HORVU.MOREX.CNS.6HG00557450.1_R1 HORVU.MOREX.CNS.6HG00557450.1_R1 14442 1 96.121123 0.41472576 96.121123 0.41472576
HORVU.MOREX.CNS.6HG00557440.1_R1 HORVU.MOREX.CNS.6HG00557440.1_R1 14442 1 180.87518 0.67009425 180.87518 0.67009425
HORVU.FT11.CNS.5HG00449630.1_R1 . 14845 1 239.69594 0.9552738 0 0
HORVU.RGT_PLANET.CNS.5HG00443360.1_R1 . 14845 1 295.05617 0.95246482 0 0
HORVU.MOREX.CNS.2HG00147250.1_R1 HORVU.MOREX.CNS.2HG00147250.1_R1 15074 1 43.333178 0.12195811 43.333178 0.12195811
HORVU.MOREX.CNS.2HG00147240.1_R1 HORVU.MOREX.CNS.2HG00147240.1_R1 15074 1 283.16 1.5547417 283.16 1.5547417
HORVU.FT11.CNS.3HG00279880.1_R1 . 15254 1 199.64723 1.0326653 0 0
HORVU.RGT_PLANET.CNS.3HG00274730.1_R1 . 15254 1 262.34516 1.7531095 0 0
HORVU.FT11.CNS.6HG00491940.1_R1 . 15456 1 174.42454 0.70835785 0 0
HORVU.MOREX.CNS.6HG00479350.1_R1 . 15456 1 217.56085 1.0064432 0 0
HORVU.AKASHINRIKI.CNS.2HG00181600.1_R1 HORVU.AKASHINRIKI.CNS.2HG00181600.1_R1 15760 1 1005.8775 10.044326 1005.8775 10.044326
I've had a look at differences between the joint file and the other output file. There is some overlap with transcripts and TPM values. Is it safe to merge these files together? or should they be handled completely separate?
Below I have merged the columns. Some transcripts are repeated withe exact TPM values, and other have slightly different TPM values, I assume because these differences are attributed to the transcript having an alternative allele and is heterogenous. For example HORVU.AKASHINRIKI.CNS.1HG00000760.1 has a different TPM in joint vs other output file.
HORVU.AKASHINRIKI.CNS.1HG00000060.1_R1 9949 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000070.1_R1 3240 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000080.1_R1 3856 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000090.1_R1 3337 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000100.1_R1 3577 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000110.1_R1 1259 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000120.1_R1 6241 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000130.1_R1 29621 0.028664199 0 0
HORVU.AKASHINRIKI.CNS.1HG00000130.1_R1 29621 0.028664199 0 0
HORVU.AKASHINRIKI.CNS.1HG00000140.1_R1 1114 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000150.1_R1 201483 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000160.1_R1 491 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000170.1_R1 6110 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000180.1_R1 201482 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000190.1_R1 201481 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000680.1_R1 6673 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000690.1_R1 1707 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000700.1_R1 9832 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000710.1_R1 201480 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000720.1_R1 201479 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000730.1_R1 201478 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000740.1_R1 201477 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000750.1_R1 201476 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000760.1_R1 20471 1 29 0.46102147
HORVU.AKASHINRIKI.CNS.1HG00000760.1_R1 20471 1 58 0.92204293
HORVU.AKASHINRIKI.CNS.1HG00000770.1_R1 2809 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000780.1_R1 2809 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000790.1_R1 1380 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000800.1_R1 186 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000810.1_R1 201475 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000820.1_R1 1918 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000830.1_R1 201474 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000840.1_R1 4514 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000850.1_R1 19547 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000860.1_R1 201473 0 0 0
HORVU.AKASHINRIKI.CNS.1HG00000870.1_R1 28384 0.37608521 0.94450749 0.021338014
HORVU.AKASHINRIKI.CNS.1HG00000870.1_R1 28384 0.37608521 0.94450749 0.021338014
Another example
From rpvg_joint.txt file:
HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1 14798 0.25 17.624975 1.018912
HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1 14798 0.5 35.2499 2.0378212
From rpvg.txt file:
HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1 14798 0.75 70.49985 4.0756453
I notice here the joint file has 2 entries for same transcript first line is from where it has TPM values for TPM_1 and TPM_2 exactly the same, the second line only TPM_1 is filled, zero for TPM_2. Also two different haplotype probabilities for same transcript.
In the rpvg.txt file we have same transcript, its the sum of the haplotype probabilities in the joint file, but the TPM values dont sum up to match the joint file. Based on this, perhaps its best to treat rpvg.txt and rpvg_joint. results separately? Is one more accurate than the other? How should I make use of each result?
I believe the rpvg.txt
is what you get if you sum up the TPM values for that transcript across every joint pair that it's included in from rpvg_joint.txt
, so rpvg.txt
is informational redundant with rpvg_joint.txt
but not the other way around. It seems strange to me that you are seeing multiple reference transcripts with the same ID. Are there multiple copies of it in your input annotations?
To be sure, I checked the input haplotype GTF file and for transcript HORVU.AKASHINRIKI.CNS.1HG00002340.1 mentioned above there is only one occurrence of it in the GTF file.
If the joint pair should add up, to the values in rpvg.txt, then I think I see whats going on here:
HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1 HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1 14798 0.25 17.624975 1.018912 17.624975 1.018912
HORVU.AKASHINRIKI.CNS.1HG00002340.1_R1 . 14798 0.5 35.2499 2.0378212 0 0
This is the exact output from the rpvg_joint.txt.
TPM_1, TPM_2 with probability 0.25 and the TPM value assigned to the next line, all add up to 4.
I don't understand why the joint file has 2 lines for the same transcript though. Does that mean that transcript HORVU.AKASHINRIKI.CNS.1HG00002340.1 0.25 probability of being hetergenous alelle and 0.5 probability of being homogenous allele?
Perhaps I should filter the probabilities, and look for anything 0.5 and above?
Since my TPM values are exactly the same in TPM_1 and TPM_2 in the joint file, perhaps I should only be using rpvg.txt output file? I find the number of transcripts in rpvg.txt is roughly doubly that of rpvg_joint.txt.
Hi
I have a pangenome graph with 5 genomes generated from minigraph-cactus.
I'm currently indexing with autoindex to use mpmap and rpvg.
When aligning RNA-seq reads should I align each set of reads from each sample, create multiple GAM files, then concatenate the GAM files, then convert to GAMP according to this wiki: https://github.com/vgteam/vg/wiki/Multipath-alignments-and-vg-mpmap
And then run rpvg? Thanks.