shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

Not incorporating labels into MPRAnalyze inputs #80

Closed calhoujd closed 9 months ago

calhoujd commented 10 months ago

Hi,

First off, want to say thanks for MPRAflow which has been very helpful. I have run into a potential issue though as I am trying to utilize the output from --mpranalyze to analyze my data in R with the MPRAnalyze package. I have noticed that despite providing a labels file with --labels, the --mpranalyze files seem to use defaults as if no labels file was provided. Any suggestions?

Here is my nextflow call, slightly modified by the IT team here to get things to work on our HPC:

nextflow-run-count --experiment-file "/projects/b1073/MPRA/epiMPRA_bcCOUNT_21dec2023nextSeq_20Mreads/experiment_combined_biorep1.csv" --dir "/projects/b1073/MPRA/epiMPRA_bcCOUNT_21dec2023nextSeq_20Mreads" --association /projects/b1073/MPRA/MPRAflow/Assoc_Basic/output/test_bc2cre_assoc_epiMPRA_deltaCMV_26oct2023_v2/test_bc2cre_assoc_epiMPRA_deltaCMV_26oct2023_v2/test_bc2cre_assoc_epiMPRA_deltaCMV_26oct2023_v2_filtered_coords_to_barcodes.pickle --design /projects/b1073/MPRA/epiMPRA_bc2cre_Assoc_26oct2023miniseq/epiMPRA_deltaCMV_design.fa --thresh 10 --umi-length 16 --bc-length 15 --outdir /projects/b1073/MPRA/epiMPRA_bcCOUNT_21dec2023nextSeq_20Mreads/Count_MPRAnalyze_Data_TESTcombined_v2 --labels /projects/b1073/MPRA/test_labels_epiMPRA_v1.tsv --mpranalyze

Here are first few lines of test_labels_epiMPRA_v1.tsv used as input for --labels:

1685alt 1685 1044ref 1044 1756alt 1756 949alt 949 353alt 353 ...

Here are first few lines of the dna_annot file generated by the nextflow barcode count script, which appears to be identical to running the above command with the --labels argument omitted:

sample type condition replicate barcode DNA_CondidtionC_1_1 DNA CondidtionC 1 1 DNA_CondidtionC_1_2 DNA CondidtionC 1 2 DNA_CondidtionC_1_3 DNA CondidtionC 1 3 DNA_CondidtionC_1_4 DNA CondidtionC 1 4 ...

Please let me know if there is any additional information I can provide. Thank you!!

calhoujd commented 10 months ago

Unless the final_labeled_counts.txt file is the actual file to be used as MPRAnalyze input? Sorry the documentation isn't super clear, I was trying to match the files to the MPRAnalyze vignette as best as I could. Happy to help write an updated documentation for how to run MPRAnalyze from MPRAflow if we can get this solved.

visze commented 10 months ago

Hi.

Thanks for your comment and great to hear that Mpra flow was helpful to you. Unfortunately, I am the only developer on the pipeline left, currently on parental leave and I focus more on the development of MPRAsnakeflow for MPRA processing. This pipeline will support more MPRA designs, focus towards variants (which are suboptimal in MPRAflow) and will be the standard pipeline in the IGVF consortium.

But of course I am happy to merge any reasonable PR into the repository, e.g. if you want to add some fixes for the mpranalyze output.

We did also extensive comparisons between different analysis software and saw that mpralm seems to work nicely. Also the input is a bit more easier to generate. Don't want to hide that information.

Best, Max

calhoujd commented 9 months ago

Hi Max,

Thanks for the update! I will check out MPRAsnakeflow and mpralm. I'll close this comment.

Take care,

Jeff