Old version SplAdder output file --- "Event Files in HDF5 Format" --- what's the meaning of "num_verified" dataset?

ifreeman6 commented 2 years ago

spladder version: SplAdder version 1.0.0

Description

The output file "Event Files in HDF5 Format" of the old version spladder includes some datasets called num_verified, iso1, iso2, I just wonder that what's the meaning of them ? I can't find the explanation about them in the lastest and stable SplAdder documentation (https://spladder.readthedocs.io/en/latest/).

What I Did

akahles commented 2 years ago

Dear @ifreeman6 ,

Thank you very much for reaching out. First off, please use the latest version of SplAdder if possible, as some of the data elements have changed over time.

However, the output fields you are interested in still exist. I am currently in the process of extending the documentation in this direction.

The num_verified is a 2-dimensional count matrix (V x E) containing the number of samples where validation criterion V was met for event E. The validation criteria are different per event type. I have listed them in a previous issue (https://github.com/ratschlab/spladder/issues/140). The description has already been added to the docs and will be available with the next version.

The fields iso1 and iso2 are 2-dimensional matrices (S x E) containing the number of spliced reads supporting isoform 1 and isoform 2, respectively. The percent-spliced-in value (PSI) is computed as iso1 / (iso1 + iso2). If iso1 + iso2 is less than a threshold (per default 10), no PSI is computed and NA is used instead. This is done, as count ratios become very unstable for small counts. Also this information has been added to the docs and will be made available with the next release.

Best,

Andre

ifreeman6 commented 2 years ago

Thank you! it's very helpful. So sorry, I have another question, what's the difference between conf_idx and confirmed ?

I understand the meaning of field conf_idx, but I am a little confused about the field confirmed .

Whether the field confirmed represents the samples numbers ? I am not uncertain...

akahles commented 2 years ago

Dear @ifreeman6 ,

here the explanation (also taken from the future docs):

conf_idx: 0-based index set, containing the index of the events that are confirmed in the provided samples (that have a **confirmed** value greater than 0)
confirmed: integer array containing for each event the minimal support of validation criteria over samples

Best, Andre

ifreeman6 commented 2 years ago

Dear @ifreeman6 ,

here the explanation (also taken from the future docs):

conf_idx: 0-based index set, containing the index of the events that are confirmed in the provided samples (that have a **confirmed** value greater than 0)
confirmed: integer array containing for each event the minimal support of validation criteria over samples

Best, Andre

Dear Professor Andre,

Thanks for your reply. It's really helpful for me.

Recently, I am processing a dataset (https://gdc.cancer.gov/about-data/publications/PanCanAtlas-Splicing-2018), which perform the alternative splicing analysis of TCGA and GTEx project by using SplAdder. I want to filter the merge_graphs_spliceType_C2.confirmed.txt.gz files, because they have a huge of AS events. I intend to filter them by the following two steps:

Firstly, I filtered the confirmed dataset in the merge_graphs_spliceType_C2.counts.hdf5 file by selecting the events which confirmed > median(confirmed) . This step is to select those events supported by multiple samples.

Secondly, I filtered the events by psi values. Specifically, retain the events which the percent of psi = NA < 1/3 over all the samples. This step is to ensure that the event have enougn reads to support. ( because NA reperents iso1 + iso2 < 10 )

I don't know if this is reasonable. I hope you can give me some suggestions or correct methods. Looking forward to your reply. Thanks again.

ifreeman6 commented 2 years ago

Dear @ifreeman6 , here the explanation (also taken from the future docs):
conf_idx: 0-based index set, containing the index of the events that are confirmed in the provided samples (that have a **confirmed** value greater than 0)
confirmed: integer array containing for each event the minimal support of validation criteria over samples
Best, Andre
Dear Professor Andre,

Thanks for your reply. It's really helpful for me.

Recently, I am processing a dataset (https://gdc.cancer.gov/about-data/publications/PanCanAtlas-Splicing-2018), which perform the alternative splicing analysis of TCGA and GTEx project by using SplAdder. I want to filter the merge_graphs_spliceType_C2.confirmed.txt.gz files, because they have a huge of AS events. I intend to filter them by the following two steps:

Firstly, I filtered the confirmed dataset in the merge_graphs_spliceType_C2.counts.hdf5 file by selecting the events which confirmed > median(confirmed) . This step is to select those events supported by multiple samples.

Secondly, I filtered the events by psi values. Specifically, retain the events which the percent of psi = NA < 1/3 over all the samples. This step is to ensure that the event have enougn reads to support. ( because NA reperents iso1 + iso2 < 10 )

I don't know if this is reasonable. I hope you can give me some suggestions or correct methods. Looking forward to your reply. Thanks again.

Dear Professor @akahles Andre, I'm anxiously awaiting your response. So sorry that taking your precious time. I would appreciate it if you could help me. Thank you so so so much !

akahles commented 2 years ago

Dear @ifreeman6 ,

Given your filters, I only understand the second one, which filters events based on the read-support, which I think is a reasonable criterion. For the first one, you are basing your filter on the confirmed array. This is an array of binary values. Depending on the number of 1-entries, the median will either be 0 (if more 0s) or 1 (if more 1s). Then your comparison either returns the confirmed_idx array unfiltered or empty.

Please note that the issue tracker here on GitHub is for discussing bugs and problems of the SplAdder software. Unfortunately, I do not have the capacity to discuss individual analyses or data sets.

Best,

Andre

akahles commented 2 years ago

I will close this issue for now. Please re-open if further discussion on the issue is needed or open a new issue if further questions should arise.

Thanks, Andre

ratschlab / spladder

Old version SplAdder output file --- "Event Files in HDF5 Format" --- what's the meaning of "num_verified" dataset? #145

Description

What I Did