Help understanding how to process my data

SebastienNin commented 3 years ago

Hi, I would like to use MPRAflow to process newly sequenced MPRA data, but I'm not sure of how to do it properly.

The design of the data is: Cond1_rep1, Cond1_rep2, Cond2_rep1, Cond2_rep2.

For each sample, I have the DNA and RNA sequenced and for each cond + rep, I have a read1, read2 and an index fastq (illumina indexes, no UMI) file. In total, I have 8 group of fastq files.

I ran the association workflow on each sample independently, but when I need to run the count workflow, I'm blocked because it need a pickle file. I have a pickle file by cond + rep + DNA/RNA, so I have 8 pickle files. I think I missed something for the association workflow. Should I merge all fastq file before running the association workflow so I get one pickle file? Or did I do it right and should I run count workflow on each sample (even if it sound strange to me)?

Also, my association workflow output seems to be empty but I think it is because of the number of different index in the index fastq file (8 different indexes).

Can you give me some help please? I can send you more info if needed.

Regards, Sebastien

makirc commented 3 years ago

Hi Sebastien,

There seems to be some confusion about these two different parts. The association workflow creates a pickle file with the association of tags/barcodes to variants/inserts. This is separate to the tag/barcode counts that you obtained and for which you have RNA and DNA fastq files.

You typically do the association first as you will need the pickle file for the count workflow. The association can derive from your experimental design (i.e. you have synthesized your variants/inserts in combination with the barcodes) or needs to derive from some separate sequencing of the variant/insert and barcodes. The workflow is meant to do the sequencing derived part.

I hope this helps.

Best,

Martin

SebastienNin commented 3 years ago

Hi Martin,

Can we clarify the terms used? What do you call tags/barcodes? In my case, I have an Illumina barcode of length 6 in the I1 fastq file obtained from the sequencer. What do you call variants/inserts?

The insert I want to test (if I understood right it is called the CRS in tutorials) is a mix of synthetized oligos of lentgh 150. Each oligo has an SNP in a fixed position in the reads (50 75 100 from 5' end). We sequenced it in Paired-End 150 because we expect a portion of the 5' adapter to be sequenced. I trimmed the adapter sequence from my reads and ran the association workflow for each paired-end fastq file.

Did we missed something in the experimental design?

Best, Sebastien

makirc commented 3 years ago

Hi Sebastien,

Yes, sorry. Terminology is tricky and everyone seems to have some preferred terms. variants/inserts = CRS; tags == barcodes is what you read out as proxy of the abundance of each CRS in either RNA and DNA. index read(s) is/are the technical reads that you do on your Illumina instrument in addition to forward/reverse = paired end reads.

A typical MPRA design might start with a CRS on the same molecule as the tag/barcode. Either due to oligo synthesis or as the result of some amplification step with primers synthesized with ambiguous positions. In MPRAs, you then insert the actual reporter gene between CRS and tag/barcode. In the following experiments you will only read the tag/barcode from your pool of cells. If you don't have a tag/barcode, but instead integrate your CRS in the 3' UTR and sequence the CRS from RNA and DNA in your pool cells, it is called STARR-seq and different from what this pipeline supports.

So what do you have in your hands? An MPRA/CRE-seq experiment or STARR-seq?

Best,

Martin

SebastienNin commented 3 years ago

Hi Martin,

Thank you for your answer.

From what you say, I think I have STARR-seq... I'll talk to my collaborator to understand why they called their experiment MPRA and see if I have all information.

Thanks again.

Best

Sebastien

makirc commented 3 years ago

Ok, thanks for checking. I am closing this one out for now. Please feel free to contact me by reopening or reaching out by email.

shendurelab / MPRAflow

Help understanding how to process my data #51