Save space at merge step

zyh4482 commented 3 months ago

Hello!

I'm working on large cohort with thousands of samples ( > 100 TB). However, due to the limitation of storage, it is not possible to keep all aligned reads at the same time. I found no matter which mode I choose, event quantification still requires me to pass all the bams to SplAdder. Am I correct? In my case, is it impossible to use SplAdder for detecting events from thousands of bams?

akahles commented 1 month ago

Dear @zyh4482 , thanks for reporting your insights. This could still work, if you split up the process. You can build a single graph per input file with out quantifying the graphs. The resulting graphs should be sufficiently small such that you can keep tens of thousands of them. Once all single graphs are generated, you can hierarchically merge them into a single merged graph. Then you can re-use this graph to quantify your samples in batches (as you don't have access to all samples at the same time). It would need a bit of custom code to take the quantification objects from the individual batches and merge them. Let me know if you consider going this route and I can provide hints or a script. Best,

Andre

zyh4482 commented 1 month ago

@akahles It's so great to hear about that! So many thanks for your kindly advice!

Would you mind sharing some codes or hints to achieve this kind of task?

akahles commented 1 month ago

Sure - what you would need to do to combine the quantifications of a single merged graph across many bam-file batches is to join the count matrices. Let's have a look at the structure of the count file. If you have h5py installed, you can inspect the content of the count hdf5 on the command line like so:

h5ls -r tests/testcase_events/results_merged/spladder/genes_graph_conf3.merge_graphs.count.hdf5

This is a file that is delivered with SplAdder as part of the test suite. The output should look like the following:

/                        Group
/edge_idx                Dataset {50}
/edges                   Dataset {50, 20}
/gene_ids_edges          Dataset {50, 1}
/gene_ids_segs           Dataset {66, 1}
/gene_names              Dataset {16, 1}
/samples                 Dataset {20}
/seg_len                 Dataset {66, 1}
/seg_pos                 Dataset {66, 20}
/segments                Dataset {66, 20}
/strains                 Soft Link {/samples}

This test case is run over a small cohort of 20 test samples. That is any matrix where you see a 20, you would need to concatenate across your batches along this axis. The remaining numbers should be the same for all of your quantification files, as they are determined by the underlying graph, which would be the same for your batch-wise quantification.

akahles commented 1 month ago

Sorry that this is a bit complicated, but building graphs on >100TB input data is nothing that is usually done and counts as a "special case" :)

zyh4482 commented 1 month ago

Great! Thanks for your kindly help. I'll run some test and post my feedback later. Best regards to you

zyh4482 commented 1 month ago

@akahles Hello!

I've made some tests and it works! Thank you very much.

But another issue occurs.

My dataset has largely overlapped with your previous work on Cancer Cell 2018. So I tried to check if the result is okay to use regarding to your benchmarks.

The GFF3 file is downloaded from GDC: merge_graphs_alt_3prime_C2.confirmed.gff3. HDF5 file is downloaded in the same way as well.

I checked alt_3prime.confirmed hdf5 but there are some confusing results.

>>> a3_tumor['event_pos'][a3_tumor['gene_idx'] == 6]
array([[861117, 861180, 861117, 861145, 861301, 861393, 861301, 861393],
       [870105, 870221, 870105, 870305, 871151, 871173, 871151, 871173],
       [874651, 874792, 874651, 874840, 876523, 876686, 876523, 876686],
       [874654, 874792, 874654, 874840, 877515, 877553, 877515, 877553],
       [874654, 874792, 874654, 874840, 877789, 877868, 877789, 877868],
       [877116, 877241, 877116, 877316, 877515, 877553, 877515, 877553],
       [877938, 878438, 877938, 878757, 879077, 879188, 879077, 879188],
       [878632, 878833, 878632, 878757, 879077, 879188, 879077, 879188]])

Here is the event_pos regarding to gene ENSG00000187634.6. hg19 reference exons coordinates can be found here.

According to the GFF3, there are 21 confirmed alt_3prime events. For example:

##gff-version 3
1       alt_3prime      gene    860260  861393  .       +       .       ID=alt_3prime.1;GeneName="ENSG00000187
634.6"
1       alt_3prime      mRNA    860260  861393  .       +       .       ID=alt_3prime.1_iso1;Parent=alt_3prime
.1;GeneName="ENSG00000187634.6"
1       alt_3prime      exon    860260  860328  .       +       .       Parent=alt_3prime.1_iso1
1       alt_3prime      exon    861302  861393  .       +       .       Parent=alt_3prime.1_iso1
1       alt_3prime      mRNA    860260  861393  .       +       .       ID=alt_3prime.1_iso2;Parent=alt_3prime
.1;GeneName="ENSG00000187634.6"
1       alt_3prime      exon    860260  860328  .       +       .       Parent=alt_3prime.1_iso2
1       alt_3prime      exon    861278  861393  .       +       .       Parent=alt_3prime.1_iso2
1       alt_3prime      gene    860530  861393  .       +       .       ID=alt_3prime.2;GeneName="ENSG00000187
634.6"
1       alt_3prime      mRNA    860530  861393  .       +       .       ID=alt_3prime.2_iso1;Parent=alt_3prime
.2;GeneName="ENSG00000187634.6"
1       alt_3prime      exon    860530  860569  .       +       .       Parent=alt_3prime.2_iso1
1       alt_3prime      exon    861302  861393  .       +       .       Parent=alt_3prime.2_iso1
1       alt_3prime      mRNA    860530  861393  .       +       .       ID=alt_3prime.2_iso2;Parent=alt_3prime
.2;GeneName="ENSG00000187634.6"
1       alt_3prime      exon    860530  860569  .       +       .       Parent=alt_3prime.2_iso2
1       alt_3prime      exon    861278  861393  .       +       .       Parent=alt_3prime.2_iso2
1       alt_3prime      gene    861118  861393  .       +       .       ID=alt_3prime.3;GeneName="ENSG00000187
634.6"
1       alt_3prime      mRNA    861118  861393  .       +       .       ID=alt_3prime.3_iso1;Parent=alt_3prime
.3;GeneName="ENSG00000187634.6"
1       alt_3prime      exon    861118  861180  .       +       .       Parent=alt_3prime.3_iso1
1       alt_3prime      exon    861302  861393  .       +       .       Parent=alt_3prime.3_iso1
1       alt_3prime      mRNA    861118  861393  .       +       .       ID=alt_3prime.3_iso2;Parent=alt_3prime
.3;GeneName="ENSG00000187634.6"
1       alt_3prime      exon    861118  861180  .       +       .       Parent=alt_3prime.3_iso2
1       alt_3prime      exon    861278  861393  .       +       .       Parent=alt_3prime.3_iso2
1       alt_3prime      gene    866419  871173  .       +       .       ID=alt_3prime.4;GeneName="ENSG00000187
634.6"
1       alt_3prime      mRNA    866419  871173  .       +       .       ID=alt_3prime.4_iso1;Parent=alt_3prime
.4;GeneName="ENSG00000187634.6"
1       alt_3prime      exon    866419  866469  .       +       .       Parent=alt_3prime.4_iso1
1       alt_3prime      exon    871152  871173  .       +       .       Parent=alt_3prime.4_iso1
1       alt_3prime      mRNA    866419  871173  .       +       .       ID=alt_3prime.4_iso2;Parent=alt_3prime
.4;GeneName="ENSG00000187634.6"
1       alt_3prime      exon    866419  866469  .       +       .       Parent=alt_3prime.4_iso2
1       alt_3prime      exon    871156  871173  .       +       .       Parent=alt_3prime.4_iso2

However, based on the description of hdf5 event_pos, there are only 8 events can be identified within the same gene (The first pop-up gene_idx is 6.0, therefore I used this command to extract the event_pos: a3_tumor['event_pos'][a3_tumor['gene_idx'] == 6]).

In addition, taking alt_3prime.5 as an example: ''' 1 alt_3prime gene 866419 871276 . + . ID=alt_3prime.5;GeneName="ENSG00000187 634.6" 1 alt_3prime mRNA 866419 871276 . + . ID=alt_3prime.5_iso1;Parent=alt_3prime .5;GeneName="ENSG00000187634.6" 1 alt_3prime exon 866419 866469 . + . Parent=alt_3prime.5_iso1 1 alt_3prime exon 871185 871276 . + . Parent=alt_3prime.5_iso1 1 alt_3prime mRNA 866419 871276 . + . ID=alt_3prime.5_iso2;Parent=alt_3prime .5;GeneName="ENSG00000187634.6" 1 alt_3prime exon 866419 866469 . + . Parent=alt_3prime.5_iso2 1 alt_3prime exon 871156 871276 . + . Parent=alt_3prime.5_iso2 '''

The exon coordinates of this minigene cannot be matched to any event_pos.

May I ask:

Why there are mismatches between hdf5 event_pos and GFF3?
How to interpret event_pos in exon skip, alt_3prime, alt_5prime, mutex, IR, respectively?
I found the write.py holds a function icgc to write down the psi table, which seems to be the same output as you provided in GDC. There are two columns: event_coordinates and alt_region_coordinates. But I don't fully understand how the coordinates are generated?

ratschlab / spladder

Save space at merge step #200