SRA is hierarchically structured, with experiments (SRX) representing an individual library and runs (SRR) representing technical replicates of running a library on multiple lanes. It makes sense to collapse SRRs to the SRX level, but there is concern that sometimes SRRs within and SRX were are from separate libraries. To look for these types of relationships it makes sense to look at the correlation of SRRs within SRX and determine if they look like they are from the same library.
There are three classes of SRXs: (1) singletons which have only one SRR (n=13,456), (2) dubletons which have two SRRs (n=2,056), and (3) multi which have 3 or more SRRs per SRX (n=6,720). I also identified 170 runs that were missing their counts files. I need to go back and re-run these samples, but wanted to move on for now.
I have thought a lot about how to approach this problem. I think the method I have come up with is reasonable. I used a bootstrap approach to approximate and correlation cutoff. Briefly, using samples in the doubletons, I boostrapped 1,000 experiments each with ~1,000 random pairs of samples. I calculated the correlation between each of these samples and found the 95% correlation cutoff (i.e., the correlation value which 95% of random pairs fell below). I took the median from these 1,000 simulations and set a correlation cutoff of ~0.97. I think this number is conservative, because my original sample distribution was small enough that there are probably some real sample pairs in the the simulations.
For doubletons, I looked for sample pairs that had correlations below the cutoff. I only kept the SRR with the largest library size. For multi, I compared each SRR (within SRX) to the median of the group. If the correlation was below the cutoff criteria I dropped the SRR. This could potentially remove excessive amounts of SRRs if there is one SRR that is very different and it drives the median further away from the other samples. It would probably be better to do some sort of iterative approach, but this still seems reasonable. Initially for the multi I was trying to use Mahalanobis distance instead, but could never get this to behave correctly.
In general, most SRRs appeared to be highly correlated. I only ended up flagging 1,511 SRRs from 606 SRXs as having low correlations.
Output
I have a file of flags located at ../../output/correlation_downstream_analysis.pkl
flag_missing_counts: True if the counts file was missing.
flag_singleton: True if the SRX only had 1 SRR.
flag_doubleton: True if the SRX only had 2 SSRs.
flag_multi: True if the SRX had 3 or more SRRs.
`flag_drop_corr': True if correlation was below cutoff of 0.97
Questions and Tasks
[x] Are there any examples where SRRs are poorly correlated?
yes there are some with very low correlations
srx
corr
srr_counts
SRX043517
0.147669
{'SRR103723': 29615181, 'SRR103724': 28138346}
[x] Are there any other features that would suggest why these SRRs are not related?
One common theme is low read counts. However, this is not the case for the above example.
[x] Create a list of SRRs that are save to merge to the SRX.
Story
SRA is hierarchically structured, with experiments (SRX) representing an individual library and runs (SRR) representing technical replicates of running a library on multiple lanes. It makes sense to collapse SRRs to the SRX level, but there is concern that sometimes SRRs within and SRX were are from separate libraries. To look for these types of relationships it makes sense to look at the correlation of SRRs within SRX and determine if they look like they are from the same library.
There are three classes of SRXs: (1) singletons which have only one SRR (n=13,456), (2) dubletons which have two SRRs (n=2,056), and (3) multi which have 3 or more SRRs per SRX (n=6,720). I also identified 170 runs that were missing their counts files. I need to go back and re-run these samples, but wanted to move on for now.
I have thought a lot about how to approach this problem. I think the method I have come up with is reasonable. I used a bootstrap approach to approximate and correlation cutoff. Briefly, using samples in the doubletons, I boostrapped 1,000 experiments each with ~1,000 random pairs of samples. I calculated the correlation between each of these samples and found the 95% correlation cutoff (i.e., the correlation value which 95% of random pairs fell below). I took the median from these 1,000 simulations and set a correlation cutoff of ~0.97. I think this number is conservative, because my original sample distribution was small enough that there are probably some real sample pairs in the the simulations.
For doubletons, I looked for sample pairs that had correlations below the cutoff. I only kept the SRR with the largest library size. For multi, I compared each SRR (within SRX) to the median of the group. If the correlation was below the cutoff criteria I dropped the SRR. This could potentially remove excessive amounts of SRRs if there is one SRR that is very different and it drives the median further away from the other samples. It would probably be better to do some sort of iterative approach, but this still seems reasonable. Initially for the multi I was trying to use Mahalanobis distance instead, but could never get this to behave correctly.
In general, most SRRs appeared to be highly correlated. I only ended up flagging 1,511 SRRs from 606 SRXs as having low correlations.
Output
../../output/correlation_downstream_analysis.pkl
flag_missing_counts
: True if the counts file was missing.flag_singleton
: True if the SRX only had 1 SRR.flag_doubleton
: True if the SRX only had 2 SSRs.flag_multi
: True if the SRX had 3 or more SRRs.Questions and Tasks
Definition of done