shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

Error in count.nf process create_BAM_noUMI #47

Closed townsk closed 2 years ago

townsk commented 3 years ago

Hi there,

I've been hitting an error during the process create_BAM_noUMI for my files that are indicating a potential issue with the FastQ2doubleIndexBAM.py and MergeTrimReadsBAM.py scripts. I fixed the error in string.maketrans('.','N') and the syntax error in print "Read cosensus".

Now if I run the code from count.nf lines 293-339 separately, defining each file and path, I get similar errors (see last screenshot).

Any thoughts? Many thanks!

Screen Shot 2021-07-15 at 4 34 57 PM Screen Shot 2021-07-15 at 4 35 16 PM Screen Shot 2021-07-15 at 5 20 10 PM
visze commented 3 years ago

Hey,

Can you tell me which version you use? Aktuell master branch (changed yesterday) 2.1, 2.2, 2.3,...

townsk @.***> schrieb am Do., 15. Juli 2021, 23:24:

Hi there,

I've been hitting an error during the process create_BAM_noUMI for my files that are indicating a potential issue with the FastQ2doubleIndexBAM.py and MergeTrimReadsBAM.py scripts.

Although it says files not found, if I run the code from count.nf lines 293-339 separately, defining each file and path, I get the same error.

Any thoughts? Many thanks!

alt="Screen Shot 2021-07-15 at 4 35 16 PM" src=" https://user-images.githubusercontent.com/56935409/125859572-b5d614a0-3e8d-44f4-9297-d9d7d8ede605.png"> [image: Screen Shot 2021-07-15 at 4 34 57 PM] https://user-images.githubusercontent.com/56935409/125859560-4ec02f2e-2abc-4191-b227-c545e949d5af.png <img width="818" alt="Screen Shot 2021-07-15 at 5 20 10 PM" src=" https://user-images.githubusercontent.com/56935409/125859543-380f1eef-ec32-438a-b213-46835843f8b<img width="920"

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/shendurelab/MPRAflow/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGWPMFRNE2LOFYC2CC7L7LTX5GXVANCNFSM5AOLCLFQ .

townsk commented 3 years ago

I am using version 2.2

visze commented 3 years ago

Can you try the newest 2.3.1?

townsk @.***> schrieb am Mo., 19. Juli 2021, 18:02:

I am using version 2.2

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/shendurelab/MPRAflow/issues/47#issuecomment-882667801, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGWPMHEOGET6JQTO6TVAS3TYREDFANCNFSM5AOLCLFQ .

townsk commented 3 years ago

Hi Max,

I reran using version 2.3.1 and got the same error.

visze commented 3 years ago

Hi,

OK maybe I found the issue. Are you are using uncompressed fastq files? We only support gzipped fastq files. So you should gzip then before, e.g.:

gzip -c FileX.fastq > FileX.fastq.gz

townsk @.***> schrieb am Di., 20. Juli 2021, 15:41:

Hi Max,

I reran using version 2.3.1 and got the same error.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/shendurelab/MPRAflow/issues/47#issuecomment-883402868, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGWPMH4FJO2U7WDUVFA4LDTYV4J3ANCNFSM5AOLCLFQ .

townsk commented 3 years ago

The fastq files are compressed.

Some of the examples for the experiment.csv file had the .gz included in the file names and another version didn't. So I tried out both ways and got the same error.

Although, I haven't done that for version 2.3.1 so I will try that now.

-K

townsk commented 3 years ago

Hi Max,

I did what I mentioned above, I'm still getting errors and it seems to often indicate and issues with the FastQ2doubleIndexBam.py, library.py, and MergeTrimReadsBAM.py files, mostly syntax errors. Some of which I can fix and others I'm not sure of - specifically the last syntax error called.

-K

Screen Shot 2021-07-27 at 12 02 09 PM
visze commented 3 years ago

Hi,

sorry I am a bit lost because I am on parental leave and have during this time litte time to support MPRAflow. That's why my last answers were short and not helpfull. Maybe the same for this.

But "missing parentesis in print" error souds very like a python 2.x vs 3.x issue. The scripts are written in python 2.7. you should use this environment. Normally MPRAflow handeles that with the conda environment. This should be the correct conda environment: https://github.com/shendurelab/MPRAflow/blob/3df34853904e0245507a361c700791c3e49a7e0e/conf/mpraflow_py27.yml

Best, Max

townsk commented 3 years ago

Hi Max,

No worries, I appreciate you still providing some suggestions while you're on leave. I'll continue debugging and will update the conda environment.

Cheers, K

townsk commented 3 years ago

Hi Max,

I was never able to resolve the issue - I do think your'e right about the version error, but was wondering (if you're back from leave) if you'd have time to troubleshoot with me further.

Thanks!

visze commented 3 years ago

I am back :-)

Now I have time to find the issue on your side. First of all I will run the rerun the count workflow example (https://mpraflow.readthedocs.io/en/latest/count_example1.html) with a new conda environment (right now using nextflow 20.10.0 to use the newest version without switching to the newest the major version 21.10.9) and the actual MPRAflow version 2.3.1.

When I cannot resolve the issue with the example, I think it is something specific about your data and I need your help (e.g. maybe some excerpt of your data to reproduce the error on my side).

Best, Max

visze commented 3 years ago

Ok. just to be sure. You are running the script separately?

Are you sure that you are using python 2.7 to run the script? nextflow will handle this internally (using conda) but if you run the script externally you should use the environment that I linked in one of my comment above!

But "missing parentesis in print" error souds very like a python 2.x vs 3.x issue. The scripts are written in python 2.7. you should use this environment. Normally MPRAflow handeles that with the conda environment. This should be the correct conda environment: https://github.com/shendurelab/MPRAflow/blob/3df34853904e0245507a361c700791c3e49a7e0e/conf/mpraflow_py27.yml

visze commented 3 years ago

Hi,

Count test workflow runs without any issues. Can you provide me some of your data to run the workflow on my side?

Command:

nextflow run count.nf -w Count_Basic/work --experiment-file Count_Basic/experiment.csv --dir Count_Basic/data --outdir Count_Basic/out --design Count_Basic/design.fa --association Count_Basic/HEPG2-association_filtered_coords_to_barcodes.pickle

Logs:

N E X T F L O W  ~  version 20.10.0
Launching `count.nf` [drunk_miescher] - revision: d86f5331f8
=======================================================
                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~'
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'
MPRAflow v2.3.1"
=======================================================
Pipeline Name  : shendurelab/MPRAflow
Pipeline Version: 2.3.1
Run Name       : drunk_miescher
Output dir     : Count_Basic/out
Working dir    : ***
Current home   : ***
Current user   : ***
Current path   : ***
Script dir     : ***
Config Profile : standard
Experiment File: ***
reads          : DataflowQueue(queue=[])
UMIs           : Reads with UMI
BC length      : 15
BC threshold   : 10
mprAnalyze     : false
=========================================
start analysis
executor >  slurm (32)
[96/74e20a] process > create_BAM (make idx)    [100%] 6 of 6 ✔
[c8/1a415e] process > raw_counts (6)           [100%] 6 of 6 ✔
[75/5aa8af] process > filter_counts (6)        [100%] 6 of 6 ✔
[5b/343258] process > final_counts (6)         [100%] 6 of 6 ✔
[2a/eabaed] process > dna_rna_merge_counts (2) [100%] 3 of 3 ✔
[07/0c4fe5] process > dna_rna_merge (3)        [100%] 3 of 3 ✔
[8f/ae0d36] process > calc_correlations (1)    [100%] 1 of 1 ✔
[e9/925caf] process > make_master_tables (1)   [100%] 1 of 1 ✔
Completed at: 16-Nov-2021 21:21:11
Duration    : 7h 46m 45s
CPU hours   : 30.3
Succeeded   : 32
townsk commented 3 years ago

So I found that the version error has something to do with my account on our server as a colleague can run the script without the errors - however it still doesn't run completely.

What aspects of my data would be the most helpful for troubleshooting?

visze commented 3 years ago

What do you mean by "run not completely"? The MPRAflow workflow? Or the script not on all data.

About the data for debugging. It will help if you can send me your fastq files (found in the --dir directory) and the experiment csv file. You can limit your fasta files to the first 3K reads if you like.

visze commented 2 years ago

close because of inactivity

please open again if needed