shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

Can MPRAflow deal with this type of MPRA data? #27

Closed Sylarair closed 3 years ago

Sylarair commented 4 years ago

Hi,

I am dealing with MPRA data with MPRAflow, however, I am gonna process this kind of MPRA dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87711), which lacks insert sequence, barcode, and element sequence fa.

So, my question is: can MPRAflow deal with this dataset? I have tried several times but failed. Could anyone help me with this? Thanks!

makirc commented 4 years ago

Dear Sylarair,

This is a little confusing. It seems the referenced data set has changed in this issue. You previously referred to https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE142207, which is STARR-seq data. While used for similar biological questions, MPRA and STARR-seq assays are very different in the kind of data that is obtained. For STARR-Seq the analysis is more similar to a ChIP-seq approach where you are measuring peak height over control. In contrast, MPRAs are measuring RNA/DNA ratios corresponding to pre-assigned regulatory sequence candidates. MPRAflow is therefore not appropriate for the analysis of such kind of data. Please look around for something else. I just saw this pre-print: https://www.biorxiv.org/content/10.1101/694869v3 Maybe this is a good starting point for you. I also noticed a bioconductor package (https://bioconductor.org/packages/release/bioc/html/BasicSTARRseq.html).

You now changed to an actual MPRA data set (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87711). This data set references two files: GSE87711_design.txt.gz and GSE87711_design_barcodes.txt.gz which should provide you with the relevant assignment of barcodes to candidate sequences. The whole data set might be a little tricky though, because RNA and DNA samples are not strictly matched here, i.e. you do not have RNA for every DNA sample, but you only get DNA count representations from sequencing of the plasmid library prior to the transfections. You would need to calculate RNA/DNA from these values as proxy for what made it into the the cells. Further the number of replicates varies for the different experiments. This is all not ideal. You might consider pooling the two DNA samples as well as RNA samples of the same condition.

Hope that helps.

Best, Martin

Sylarair commented 4 years ago

Thanks! I realized that dataset I pasted before was STARR-seq, so I changed it to MPRA which I focused on.

By the way, I am still confused about how to apply MPRAflow to such MPRA dataset including GSE87711 and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115042. If possible, could you give me some advice such as code?

Sylarair commented 4 years ago

The fastq was splited into 4 files, and the _1, _2.fastq.gz has length 100, while _3, _4.fastq.gz has length 12, 6.

Screen Shot 2020-06-01 at 9 48 55 AM
Sylarair commented 4 years ago
Screen Shot 2020-06-01 at 10 01 40 AM Screen Shot 2020-06-01 at 10 02 27 AM
visze commented 3 years ago

cleaning up. please reopen if still an issue