khyox / recentrifuge

Recentrifuge: robust comparative analysis and contamination removal for metagenomics
http://www.recentrifuge.org
Other
86 stars 7 forks source link

pandas and Excel output #2

Closed devonorourke closed 6 years ago

devonorourke commented 6 years ago

Hi Jose, I'm running recentrifuge in a virtual environment created as follows:

conda create -n py36 python=3.6 openpyxl biopython pandas

However, the recentrifuge program throws an error at the final step where it would generate an Excel file because it does not recognize Pandas as being installed.

The error:

Building the taxonomy multiple tree... OK!
Generating final plot (/mnt/lustre/macmaneslab/devon/pore604/centrifuge/pore604_results_unclassified.rcf.html)... OK!
WARNING! Pandas not installed: Excel cannot be created.

This is odd to me because that virtual environment suggests panadas is indeed present:

(py36) [devon@premise]$ pip freeze
biopython==1.70
certifi==2018.4.16
et-xmlfile==1.0.1
jdcal==1.4
mmtf-python==1.1.0
msgpack==0.5.6
numpy==1.12.1
olefile==0.45.1
ont-fast5-api==0.4.1
openpyxl==2.4.0b1
pandas==0.22.0
Pillow==4.2.1
progressbar33==2.4
python-dateutil==2.7.2
pytz==2018.4
reportlab==3.4.0
six==1.11.0

I thought maybe Panads was too new, so I downgraded to an older version (0.20.3) and the issue is the same. Do you have any suggestions on how to troubleshoot further?

Thanks!

Versions

khyox commented 6 years ago

Hi Devon, Thank you very much for reporting this issue! I will try to replicate the problem and come back with a workaround. Jose

devonorourke commented 6 years ago

Sounds great Jose. I was wondering how I could display my .html file online like you have and discovered rawgit. That has solved so many problems for me - so recentrifuge is proving to be extraodinarily useful! Cheers, Devon

khyox commented 6 years ago

Hi Devon, I am not an expert in conda but, surprisingly, it seems a problem unrelated with Recentrifuge. I just installed miniconda from scratch in an OS X 10.11.6, and I have problems to import pandas (and any other of the installed modules) from the conda python shell:

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:14:23)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'
>>> import biopython
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'biopython'
>>> import openpyxl
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'openpyxl'
>>> import numpy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'numpy'

I installed the new conda environment exactly as you did:

conda create -n py36 python=3.6 openpyxl biopython pandas

Could you please check that you have the same problem importing pandas (and other modules) from the conda python shell?

devonorourke commented 6 years ago

I think I've resolved the issue: I need to import numpy with pandas. The pandas import error suggests it's a dependency. I'll try running the recentrifuge script next.

khyox commented 6 years ago

It seems it is an issue with the conda environment activation. I am using tcsh as shell in that machine and the py36 environment seems not activated, even when I used the activate command. When forcing bash shell, thinks are straightaway:

bash-3.2$ source activate py36
(py36) bash-3.2$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>>

Recentrifuge is also working fine:

=-= recentrifuge.py =-= v0.18.6 =-= May 2018 =-=
(...)
Building the taxonomy multiple tree... OK!
Generating final plot (test.rcf.html)... OK!
Generating Excel full summary (test.rcf.xlsx)... OK!
devonorourke commented 6 years ago

Hi Jose, I was also able to run the program without errors this time (the Excel file was created) , but the difference is that I was not using my computer cluster and submitting the job using our SLURM program. I wonder if the issue is arising there? While I have a way to get it to work at the moment I'll try to figure out what the root of the issue is soon. Thanks, Devon

devonorourke commented 6 years ago

One final thought - why not add an option for this output file to be formatted via pandas as either an Excel sheet, or a tsv or can file? I'm sure many people could benefit from those delimited formats also.

khyox commented 6 years ago

Hi Devon,

Thank you very much for your suggestion! Yes, for alternative data output formats, I also think the choice for tsv and/or csv should be a nice feature. I will add it to the top of the Recentrifuge "to do" list.

By the way, answering your previous comment, I am glad you find Recentrifuge useful!

I am closing the issue. Please, feel free to reopen if a similar problem arises. Cheers, Jose