shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

BassicAssociation: filter_barcodes error - issue with pyarrow #36

Closed townsk closed 3 years ago

townsk commented 3 years ago

Any suggestions for "ModuleNotFoundError: No module named 'pyarrow'?"

Prior to running the association script I installed pyarrow: conda install pyarrow -c conda-forge

Screen Shot 2021-03-04 at 3 56 42 PM

Thanks!

townsk commented 3 years ago

Okay pyarrow issue resolved - now having issues with pandas. Specifically loading the module pandas._libs.interval. See error code below:

Screen Shot 2021-03-04 at 4 37 10 PM
visze commented 3 years ago

Hey. The pyarrow issue is a bit strange, because we are using nextflow with conda!

Here we define the conda environment for the filter_barcode step: https://github.com/shendurelab/MPRAflow/blob/fd5ff26b04686196ac37ed849afbcfc01b303b3f/association.nf#L494

And here you can find pyarrowin the in the environment: https://github.com/shendurelab/MPRAflow/blob/fd5ff26b04686196ac37ed849afbcfc01b303b3f/conf/mpraflow_py36.yml#L108

Maybe you nextflow run does not use conda, just your base eenvironment? This becomes very important for the count step because here we have script with python 2 and python 3.

When you run the scipt by yourselv you have to use also the environment:

conda env create -n mpraflow_p36 - f conf/mpraflow_py36.yml
conda activate mpraflow_p36
python src/nf_filter_barcodes.py
visze commented 3 years ago

@townsk you addes some comments yesterday ( saw them in my mails) but they are not listed anymore. Just let me know if you need further help

townsk commented 3 years ago

@visze Thanks for checking in. I worked through the issues I posted so I removed them, but I am still having trouble. When I enter the script myself I don't get any errors but it runs for hours without completion -- do you have an estimated runtime for the filter barcode process?

Screen Shot 2021-03-09 at 10 27 48 AM
visze commented 3 years ago

The runtime of nf_filter_barcodes.py script depends on the size of the input. But in theory it should be one of the quicker scripts. Definetifely under 1 hour.

But maybe you have some issues with plotting the violin plots. Can you comment out these lines: https://github.com/shendurelab/MPRAflow/blob/fd5ff26b04686196ac37ed849afbcfc01b303b3f/src/nf_filter_barcodes.py#L133 and https://github.com/shendurelab/MPRAflow/blob/fd5ff26b04686196ac37ed849afbcfc01b303b3f/src/nf_filter_barcodes.py#L142

townsk commented 3 years ago

Plotting the violin plots does seem to be the issue - once line 133 and 142 are commented out it runs and creates the filtered barcode pickle file.

visze commented 3 years ago

thanks. good to know that it worked now. But strange that it fails creating the plots. I can imagine 3 possible issues:

  1. Large data, to many datapoints for plotting
  2. plotting/graphical issue on your system
  3. seaborn library issue (we are using version 0.9.0, maybe we need to update)

But I can only debug this wehn you give me your input data.

visze commented 3 years ago

I will close this because I cannot debug it without the data.

please reopen if necessary