huangnengCSU / compleasm

A genome completeness evaluation tool based on miniprot
Apache License 2.0
176 stars 17 forks source link

Does compleasm try to download something when run specifying lineage? #6

Open olekto opened 1 year ago

olekto commented 1 year ago

Hi, I am trying to get compleasm running, but run into an issue. Specifically, I get this error:

Traceback (most recent call last):
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/urllib/request.py", line 1350, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/http/client.py", line 1443, in connect
    super().connect()
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/http/client.py", line 948, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/bin/compleasm", line 10, in <module>
    sys.exit(main())
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 2534, in main
    args.func(args)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 2425, in run
    mode=mode)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 2092, in __init__
    self.downloader = Downloader(library_path)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 85, in __init__
    self.lineage_description, self.placement_description = self.download_file_version_document()
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 127, in download_file_version_document
    urllib.request.urlretrieve(hash_url, hash_download_path)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/urllib/request.py", line 1393, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/cluster/projects/nn8013k/programs/miniconda3/envs/compleasm/lib/python3.7/urllib/request.py", line 1352, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

The command is this:compleasm run -a default_filt.hic.hap1.p_ctg.fa -l aves -L /cluster/projects/nn8013k/opt/busco_dbs/lineages/ -t 10 -o compleasm_test

When running on the login nodes, it looks like it actually downloads aves and eukaryota, even though the folders existed in /cluster/projects/nn8013k/opt/busco_dbs/lineages/.

Am I doing something wrong?

Our computing nodes do not have internet access, and I don't think it is nice practice to download something without letting the user know that it is happening. How can I turn this off? That is, can I download something before submitting the job to the cluster?

Thank you.

Ole

huangnengCSU commented 1 year ago

Hi @olekto Sorry for late response. During running compleasm, it will download the lineage file which is specified. So if the working nodes do not have internet access, you can use compleasm download to pre-download the lineage on the login nodes then specify the directory of the downloaded file when running on the working nodes.

Neng

olekto commented 1 year ago

I pointed compleasm to where I had downloaded all the lineages for BUSCO. Does it require something different than BUSCO, because it still downloaded the lineage. Can I use the compleasm downloaded lineages for BUSCO? I'd rather not have two sets of lineages laying around on the cluster.

What I did for BUSCO was to download everything, and then point to it. I guess I have to download each lineage independently in the case of compleasm?

When I got the lineage downloaded via compleasm, it ran successfully. So looking good so far. :)

huangnengCSU commented 1 year ago

@olekto The organization of the lineage file downloaded by compleasm is different from that of BUSCO. So directly specifying the lineage directory downloaded by BUSCO doesn't work. We will consider making compleasm compatible with the lineage files downloaded by BUSCO in future versions.

xiekunwhy commented 1 year ago

is there a way to download lineage files manually instead of using compleasm download?

arslan9732 commented 1 year ago

is there a way to download lineage files manually instead of using compleasm download?

I think you can directly download the data from here: https://busco-archive.ezlab.org/data/lineages/ and can use it after unzip.

eskutkaan commented 1 year ago

@olekto The organization of the lineage file downloaded by compleasm is different from that of BUSCO. So directly specifying the lineage directory downloaded by BUSCO doesn't work. We will consider making compleasm compatible with the lineage files downloaded by BUSCO in future versions.

I am waiting on this feature, too. Enabling the program to run with the lineage files downloaded by BUSCO will be really useful for my case.

lilinzhou commented 8 months ago

Hi, are there any updates on this issue? The protein command works well in offline mode, but the run command is not working properly in offline mode

@olekto The organization of the lineage file downloaded by compleasm is different from that of BUSCO. So directly specifying the lineage directory downloaded by BUSCO doesn't work. We will consider making compleasm compatible with the lineage files downloaded by BUSCO in future versions.

huangnengCSU commented 8 months ago

Hi @lilinzhou,

The latest version of compleasm v0.2.5 has fix the problem. The bug is caused by the update of BUSCO related file format few month ago.

lilinzhou commented 8 months ago

Hi @huangnengCSU

The latest version seems also request for a download by using run command. I found a newly empty file "file_versions.tsv.tmp" in the BUSCO database folder. The command I use: python3 /path/to/software/compleasm_kit/compleasm.py run -a genome.fasta -l eukaryota_odb10 -o out_genome -L /path/to/software/BUSCO/lineages
see the error log at the last. But everything goes well by using protein command. The command I use: python3 /path/to/software/compleasm_kit/compleasm.py protein -a protein.fasta -l eukaryota_odb10 -o out_protein -L /path/to/software/BUSCO/lineages Our nodes do not have internet access, I can only download the BUSCO database manually. Could you help to solve this problem.

Searching for miniprot in the path where compleasm.py is located
Searching for hmmsearch in the path where compleasm.py is located
miniprot execute command:
 /path/to/software/compleasm_kit/miniprot
Traceback (most recent call last):
  File "/path/to/software/Python-3.9.6/lib/python3.9/urllib/request.py", line 1346, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/path/to/software/Python-3.9.6/lib/python3.9/http/client.py", line 1257, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/path/to/software/Python-3.9.6/lib/python3.9/http/client.py", line 1303, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/path/to/software/Python-3.9.6/lib/python3.9/http/client.py", line 1252, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/path/to/software/Python-3.9.6/lib/python3.9/http/client.py", line 1012, in _send_output
    self.send(msg)
  File "/path/to/software/Python-3.9.6/lib/python3.9/http/client.py", line 952, in send
    self.connect()
  File "/path/to/software/Python-3.9.6/lib/python3.9/http/client.py", line 1426, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "/path/to/software/Python-3.9.6/lib/python3.9/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/path/to/software/Python-3.9.6/lib/python3.9/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/path/to/software/Python-3.9.6/lib/python3.9/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:1129)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/software/compleasm_kit/compleasm.py", line 2741, in <module>
    main()
  File "/path/to/software/compleasm_kit/compleasm.py", line 2737, in main
    args.func(args)
  File "/path/to/software/compleasm_kit/compleasm.py", line 2601, in run
    mr = CompleasmRunner(assembly_path=assembly_path,
  File "/path/to/software/compleasm_kit/compleasm.py", line 2114, in __init__
    self.downloader = Downloader(library_path)
  File "/path/to/software/compleasm_kit/compleasm.py", line 85, in __init__
    self.lineage_description, self.placement_description = self.download_file_version_document()
  File "/path/to/software/compleasm_kit/compleasm.py", line 127, in download_file_version_document
    urllib.request.urlretrieve(hash_url, hash_download_path)
  File "/path/to/software/Python-3.9.6/lib/python3.9/urllib/request.py", line 239, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/path/to/software/Python-3.9.6/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/path/to/software/Python-3.9.6/lib/python3.9/urllib/request.py", line 517, in open
    response = self._open(req, data)
  File "/path/to/software/Python-3.9.6/lib/python3.9/urllib/request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/path/to/software/Python-3.9.6/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/path/to/software/Python-3.9.6/lib/python3.9/urllib/request.py", line 1389, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/path/to/software/Python-3.9.6/lib/python3.9/urllib/request.py", line 1349, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:1129)>
huangnengCSU commented 8 months ago

To @lilinzhou,

I guess your problem should be caused by the network. During first run, compleasm will download some files. But since the network problem, the download did not finished and there will be some tag files having the name ending with .tmp in the download folder.

If the problem is from network, you may have to download the lineage files using compleasm download on a computer having access to database of lineage files. Then you can upload the download folder on the computer to work server. When performing compleasm run you can specify the download folder with option -L.

Why this problem only occurs in compleasm run and not in compleasm protein is because there is a process to check the lineage files in compleasm run. However, compleasm protein does not have this process (compleasm protein should have this process but I have not implemented it).