michaelkyu / PlasX

PlasX, a machine learning classifier for identifying plasmid sequences based on genetic architecture
GNU General Public License v3.0
29 stars 1 forks source link

Running plasx search_de_novo_families results in error: No.m8 file #10

Open JiabaoYuuuuu opened 6 months ago

JiabaoYuuuuu commented 6 months ago

For the previous problem, I modified the mmseq.py file and changed it to: if mmseqs_profiles_url is None: mmseqs_profiles_url = 'file:///xxxx/PlasX_mmseqs_profiles.tar.gz'

if coefficients_url is None: coefficients_url = 'file:///xxx/PlasX_coefficients_and_gene_enrichments.txt.gz'

Run after plasx setup \ --de-novo-families 'file:///xxx/PlasX_mmseqs_profiles.tar.gz' \ --coefficients 'file:///xxx/PlasX_coefficients_and_gene_enrichments.txt.gz’

Then I run the next step plasx search_de_novo_families \ -g $PREFIX-gene-calls.txt \ -o $PREFIX-de-novo-families.txt \ --threads $THREADS \ --splits 32 \ --overwrite When, the error message is: FileNotFoundError: The file /tmp/tmpienmre36/mmseqs/clu90.m8 was supposed to be created, but it doesn't exist. This might be because the search using mmseqs2 ran out of system RAM. Consider setting the -S flag to reduce the maximum RAM usage. E.g., if you only have ~8Gb RAM, we recommend setting -S to 32 or higher.

My confusion is, do I need to download additional software such as diamond to generate.m8 files? My server has a lot of memory, it should not be because mmseq2 takes up too much memory.

meren commented 6 months ago

Hey @JiabaoYuuuuu,

I am glad you managed to solve the download issue. We should change the setup function so it dynamically determines whether the user provided a URL online or a file path on their system for these files.

The memory issue is a weird one. You shouldn't need additional software to install -- the error is due to missing mmseqs2 files that were somehow not generated :( Are you submitting your jobs to the server via slurm? Or are you using it interactively?

JiabaoYuuuuu commented 6 months ago

Hi, meren, I submitted the task to the server. And I couldn't solve this issue through the manual installation method. So, I manually downloaded the files PlasX_mmseqs_profiles.tar.gz and PlasX_coefficients_and_gene_enrichments.txt.gz from Zenodo, then uploaded them to another website and downloaded them again (I have already deleted these files from the other website). After that, I ran: plasx search_de_novo_families \ -g $PREFIX-gene-calls.txt \ -o $PREFIX-de-novo-families.txt \ --threads $THREADS \ --splits 32 \ --overwrite This is the result I generated using the test files you provided. gene_callers_id contig start stop direction rev_compd length e_value accession 1 AST0002_000000019451 1152 1908 r True 756 0.0 mmseqs_40_33078316 1 AST0002_000000019451 1152 1908 r True 756 0.0 mmseqs_30_43406241 1 AST0002_000000019451 1152 1908 r True 756 0.0 mmseqs_25_49900063 1 AST0002_000000019451 1152 1908 r True 756 0.0 mmseqs_20_50193611 2 AST0002_000000009188 754 1807 f False 1053 0.0 mmseqs_70_18699477 2 AST0002_000000009188 754 1807 f False 1053 0.0 mmseqs_30_48665148 2 AST0002_000000009188 754 1807 f False 1053 0.0 mmseqs_30_44498373 2 AST0002_000000009188 754 1807 f False 1053 0.0 mmseqs_25_41046439 2 AST0002_000000009188 754 1807 f False 1053 4.37e-43 mmseqs_25_35867596 2 AST0002_000000009188 754 1807 f False 1053 0.0 mmseqs_20_42904105 2 AST0002_000000009188 754 1807 f False 1053 1.358e-37 mmseqs_20_38845624 It looks somewhat different from the provided template. Then, when I ran the next step, plasx predict, a new issue occurred: Loading model from /mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/plasx/data/PlasX_coefficients_and_gene_enrichments.txt (11:11:11) Traceback (most recent call last): File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/plasx/pd_utils.py", line 1036, in read_table C = utils.unpickle(A) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/plasx/compress_utils.py", line 288, in unpickle ret = blosc_decompress(path_or_buf, stream=stream, obj_type='pickle', verbose=verbose) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/plasx/compress_utils.py", line 268, in blosc_decompress return pkl.loads(b"".join(arr)) EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/bin/plasx", line 8, in sys.exit(run()) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/plasx/plasx_script.py", line 140, in run args.func(args) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/plasx/plasx_script.py", line 21, in predict model = PlasX_model.from_table(args.model) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/plasx/model.py", line 52, in from_table df = utils.read_table(path).set_index('accession')['PlasX_coefficient'] File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/plasx/pd_utils.py", line 1042, in read_table C = pd.read_table(A, read_table_kws) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1282, in read_table return _read(filepath_or_buffer, kwds) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 611, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1448, in init self._engine = self._make_engine(f, self.engine) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1723, in _make_engine return mapping[engine](f, self.options) File "/mnt/sdb/weizhonglab/yujiabao/lib/anaconda3/envs/plasx/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in init self._reader = parsers.TextReader(src, kwds) File "parsers.pyx", line 586, in pandas._libs.parsers.TextReader.cinit pandas.errors.EmptyDataError: No columns to parse from file

So I reran the plasx predict step using the test-contigs-de-novo-families.txt file from your test files and got the same error message. Does this mean I still haven't installed it successfully? meren, could you provide a method for manual installation? Thank you very much for your reply.