Closed ifreeman6 closed 2 years ago
Dear @ifreeman6 ,
Thank you very much for reaching out. First off, please use the latest version of SplAdder if possible, as some of the data elements have changed over time.
However, the output fields you are interested in still exist. I am currently in the process of extending the documentation in this direction.
The num_verified
is a 2-dimensional count matrix (V x E) containing the number of samples where validation criterion V was met for event E. The validation criteria are different per event type. I have listed them in a previous issue (https://github.com/ratschlab/spladder/issues/140). The description has already been added to the docs and will be available with the next version.
The fields iso1
and iso2
are 2-dimensional matrices (S x E) containing the number of spliced reads supporting isoform 1 and isoform 2, respectively. The percent-spliced-in value (PSI) is computed as iso1 / (iso1 + iso2). If iso1 + iso2 is less than a threshold (per default 10), no PSI is computed and NA is used instead. This is done, as count ratios become very unstable for small counts. Also this information has been added to the docs and will be made available with the next release.
Best,
Andre
Thank you! it's very helpful.
So sorry, I have another question, what's the difference between conf_idx
and confirmed
?
I understand the meaning of field conf_idx
, but I am a little confused about the field confirmed
.
Whether the field confirmed
represents the samples numbers ? I am not uncertain...
Dear @ifreeman6 ,
here the explanation (also taken from the future docs):
conf_idx: 0-based index set, containing the index of the events that are confirmed in the provided samples (that have a **confirmed** value greater than 0)
confirmed: integer array containing for each event the minimal support of validation criteria over samples
Best, Andre
Dear @ifreeman6 ,
here the explanation (also taken from the future docs):
conf_idx: 0-based index set, containing the index of the events that are confirmed in the provided samples (that have a **confirmed** value greater than 0) confirmed: integer array containing for each event the minimal support of validation criteria over samples
Best, Andre
Dear Professor Andre,
Thanks for your reply. It's really helpful for me.
Recently, I am processing a dataset (https://gdc.cancer.gov/about-data/publications/PanCanAtlas-Splicing-2018), which perform the alternative splicing analysis of TCGA and GTEx project by using SplAdder.
I want to filter the merge_graphs_spliceType_C2.confirmed.txt.gz
files, because they have a huge of AS events. I intend to filter them by the following two steps:
confirmed
dataset in the merge_graphs_spliceType_C2.counts.hdf5
file by selecting the events which confirmed > median(confirmed)
. This step is to select those events supported by multiple samples.psi = NA
< 1/3 over all the samples. This step is to ensure that the event have enougn reads to support. ( because NA reperents iso1 + iso2 < 10
)I don't know if this is reasonable. I hope you can give me some suggestions or correct methods. Looking forward to your reply. Thanks again.
Dear @ifreeman6 , here the explanation (also taken from the future docs):
conf_idx: 0-based index set, containing the index of the events that are confirmed in the provided samples (that have a **confirmed** value greater than 0) confirmed: integer array containing for each event the minimal support of validation criteria over samples
Best, Andre
Dear Professor Andre,
Thanks for your reply. It's really helpful for me.
Recently, I am processing a dataset (https://gdc.cancer.gov/about-data/publications/PanCanAtlas-Splicing-2018), which perform the alternative splicing analysis of TCGA and GTEx project by using SplAdder. I want to filter the
merge_graphs_spliceType_C2.confirmed.txt.gz
files, because they have a huge of AS events. I intend to filter them by the following two steps:
- Firstly, I filtered the
confirmed
dataset in themerge_graphs_spliceType_C2.counts.hdf5
file by selecting the events whichconfirmed > median(confirmed)
. This step is to select those events supported by multiple samples.
- Secondly, I filtered the events by psi values. Specifically, retain the events which the percent of
psi = NA
< 1/3 over all the samples. This step is to ensure that the event have enougn reads to support. ( because NA reperentsiso1 + iso2 < 10
)I don't know if this is reasonable. I hope you can give me some suggestions or correct methods. Looking forward to your reply. Thanks again.
Dear Professor @akahles Andre, I'm anxiously awaiting your response. So sorry that taking your precious time. I would appreciate it if you could help me. Thank you so so so much !
Dear @ifreeman6 ,
Given your filters, I only understand the second one, which filters events based on the read-support, which I think is a reasonable criterion. For the first one, you are basing your filter on the confirmed
array. This is an array of binary values. Depending on the number of 1-entries, the median will either be 0 (if more 0s) or 1 (if more 1s). Then your comparison either returns the confirmed_idx
array unfiltered or empty.
Please note that the issue tracker here on GitHub is for discussing bugs and problems of the SplAdder software. Unfortunately, I do not have the capacity to discuss individual analyses or data sets.
Best,
Andre
I will close this issue for now. Please re-open if further discussion on the issue is needed or open a new issue if further questions should arise.
Thanks, Andre
Description
The output file "Event Files in HDF5 Format" of the old version spladder includes some datasets called
num_verified
,iso1
,iso2
, I just wonder that what's the meaning of them ? I can't find the explanation about them in the lastest and stable SplAdder documentation (https://spladder.readthedocs.io/en/latest/).What I Did