Open zyh4482 opened 3 months ago
Dear @zyh4482 , thanks for reporting your insights. This could still work, if you split up the process. You can build a single graph per input file with out quantifying the graphs. The resulting graphs should be sufficiently small such that you can keep tens of thousands of them. Once all single graphs are generated, you can hierarchically merge them into a single merged graph. Then you can re-use this graph to quantify your samples in batches (as you don't have access to all samples at the same time). It would need a bit of custom code to take the quantification objects from the individual batches and merge them. Let me know if you consider going this route and I can provide hints or a script. Best,
Andre
@akahles It's so great to hear about that! So many thanks for your kindly advice!
Would you mind sharing some codes or hints to achieve this kind of task?
Sure - what you would need to do to combine the quantifications of a single merged graph across many bam-file batches is to join the count matrices. Let's have a look at the structure of the count file. If you have h5py
installed, you can inspect the content of the count hdf5 on the command line like so:
h5ls -r tests/testcase_events/results_merged/spladder/genes_graph_conf3.merge_graphs.count.hdf5
This is a file that is delivered with SplAdder as part of the test suite. The output should look like the following:
/ Group
/edge_idx Dataset {50}
/edges Dataset {50, 20}
/gene_ids_edges Dataset {50, 1}
/gene_ids_segs Dataset {66, 1}
/gene_names Dataset {16, 1}
/samples Dataset {20}
/seg_len Dataset {66, 1}
/seg_pos Dataset {66, 20}
/segments Dataset {66, 20}
/strains Soft Link {/samples}
This test case is run over a small cohort of 20 test samples. That is any matrix where you see a 20, you would need to concatenate across your batches along this axis. The remaining numbers should be the same for all of your quantification files, as they are determined by the underlying graph, which would be the same for your batch-wise quantification.
Sorry that this is a bit complicated, but building graphs on >100TB input data is nothing that is usually done and counts as a "special case" :)
Great! Thanks for your kindly help. I'll run some test and post my feedback later. Best regards to you
@akahles Hello!
I've made some tests and it works! Thank you very much.
But another issue occurs.
My dataset has largely overlapped with your previous work on Cancer Cell 2018. So I tried to check if the result is okay to use regarding to your benchmarks.
The GFF3 file is downloaded from GDC: merge_graphs_alt_3prime_C2.confirmed.gff3. HDF5 file is downloaded in the same way as well.
I checked alt_3prime.confirmed hdf5 but there are some confusing results.
>>> a3_tumor['event_pos'][a3_tumor['gene_idx'] == 6]
array([[861117, 861180, 861117, 861145, 861301, 861393, 861301, 861393],
[870105, 870221, 870105, 870305, 871151, 871173, 871151, 871173],
[874651, 874792, 874651, 874840, 876523, 876686, 876523, 876686],
[874654, 874792, 874654, 874840, 877515, 877553, 877515, 877553],
[874654, 874792, 874654, 874840, 877789, 877868, 877789, 877868],
[877116, 877241, 877116, 877316, 877515, 877553, 877515, 877553],
[877938, 878438, 877938, 878757, 879077, 879188, 879077, 879188],
[878632, 878833, 878632, 878757, 879077, 879188, 879077, 879188]])
Here is the event_pos regarding to gene ENSG00000187634.6. hg19 reference exons coordinates can be found here.
According to the GFF3, there are 21 confirmed alt_3prime events. For example:
##gff-version 3
1 alt_3prime gene 860260 861393 . + . ID=alt_3prime.1;GeneName="ENSG00000187
634.6"
1 alt_3prime mRNA 860260 861393 . + . ID=alt_3prime.1_iso1;Parent=alt_3prime
.1;GeneName="ENSG00000187634.6"
1 alt_3prime exon 860260 860328 . + . Parent=alt_3prime.1_iso1
1 alt_3prime exon 861302 861393 . + . Parent=alt_3prime.1_iso1
1 alt_3prime mRNA 860260 861393 . + . ID=alt_3prime.1_iso2;Parent=alt_3prime
.1;GeneName="ENSG00000187634.6"
1 alt_3prime exon 860260 860328 . + . Parent=alt_3prime.1_iso2
1 alt_3prime exon 861278 861393 . + . Parent=alt_3prime.1_iso2
1 alt_3prime gene 860530 861393 . + . ID=alt_3prime.2;GeneName="ENSG00000187
634.6"
1 alt_3prime mRNA 860530 861393 . + . ID=alt_3prime.2_iso1;Parent=alt_3prime
.2;GeneName="ENSG00000187634.6"
1 alt_3prime exon 860530 860569 . + . Parent=alt_3prime.2_iso1
1 alt_3prime exon 861302 861393 . + . Parent=alt_3prime.2_iso1
1 alt_3prime mRNA 860530 861393 . + . ID=alt_3prime.2_iso2;Parent=alt_3prime
.2;GeneName="ENSG00000187634.6"
1 alt_3prime exon 860530 860569 . + . Parent=alt_3prime.2_iso2
1 alt_3prime exon 861278 861393 . + . Parent=alt_3prime.2_iso2
1 alt_3prime gene 861118 861393 . + . ID=alt_3prime.3;GeneName="ENSG00000187
634.6"
1 alt_3prime mRNA 861118 861393 . + . ID=alt_3prime.3_iso1;Parent=alt_3prime
.3;GeneName="ENSG00000187634.6"
1 alt_3prime exon 861118 861180 . + . Parent=alt_3prime.3_iso1
1 alt_3prime exon 861302 861393 . + . Parent=alt_3prime.3_iso1
1 alt_3prime mRNA 861118 861393 . + . ID=alt_3prime.3_iso2;Parent=alt_3prime
.3;GeneName="ENSG00000187634.6"
1 alt_3prime exon 861118 861180 . + . Parent=alt_3prime.3_iso2
1 alt_3prime exon 861278 861393 . + . Parent=alt_3prime.3_iso2
1 alt_3prime gene 866419 871173 . + . ID=alt_3prime.4;GeneName="ENSG00000187
634.6"
1 alt_3prime mRNA 866419 871173 . + . ID=alt_3prime.4_iso1;Parent=alt_3prime
.4;GeneName="ENSG00000187634.6"
1 alt_3prime exon 866419 866469 . + . Parent=alt_3prime.4_iso1
1 alt_3prime exon 871152 871173 . + . Parent=alt_3prime.4_iso1
1 alt_3prime mRNA 866419 871173 . + . ID=alt_3prime.4_iso2;Parent=alt_3prime
.4;GeneName="ENSG00000187634.6"
1 alt_3prime exon 866419 866469 . + . Parent=alt_3prime.4_iso2
1 alt_3prime exon 871156 871173 . + . Parent=alt_3prime.4_iso2
However, based on the description of hdf5 event_pos
, there are only 8 events can be identified within the same gene (The first pop-up gene_idx is 6.0, therefore I used this command to extract the event_pos: a3_tumor['event_pos'][a3_tumor['gene_idx'] == 6]
).
In addition, taking alt_3prime.5 as an example: ''' 1 alt_3prime gene 866419 871276 . + . ID=alt_3prime.5;GeneName="ENSG00000187 634.6" 1 alt_3prime mRNA 866419 871276 . + . ID=alt_3prime.5_iso1;Parent=alt_3prime .5;GeneName="ENSG00000187634.6" 1 alt_3prime exon 866419 866469 . + . Parent=alt_3prime.5_iso1 1 alt_3prime exon 871185 871276 . + . Parent=alt_3prime.5_iso1 1 alt_3prime mRNA 866419 871276 . + . ID=alt_3prime.5_iso2;Parent=alt_3prime .5;GeneName="ENSG00000187634.6" 1 alt_3prime exon 866419 866469 . + . Parent=alt_3prime.5_iso2 1 alt_3prime exon 871156 871276 . + . Parent=alt_3prime.5_iso2 '''
The exon coordinates of this minigene cannot be matched to any event_pos.
May I ask:
event_coordinates
and
alt_region_coordinates
. But I don't fully understand how the coordinates are generated?
Hello!
I'm working on large cohort with thousands of samples ( > 100 TB). However, due to the limitation of storage, it is not possible to keep all aligned reads at the same time. I found no matter which mode I choose, event quantification still requires me to pass all the bams to SplAdder. Am I correct? In my case, is it impossible to use SplAdder for detecting events from thousands of bams?