ncbi / fcs-gx

Foreign Contamination Screening - GX source code
Other
11 stars 4 forks source link

Errors during operation #2

Closed liangchengbo closed 4 months ago

liangchengbo commented 4 months ago

In the process of running run_gx.py, the following error is reported. Can you help me to solve it? thanks!!!

Fatal error: index.cpp:484 in from_stream(...): Unrecognized file content. Warning: missing header '##[["GX hits",2,1]]' Fatal error: taxify.cpp:350 in make_run_info_json(...): Assertion failed: agg_cvg <= 1 Error: Process failed with retcode 1: ['nice', '-n19', '/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/gx', 'align', '--gx-db=/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/database/gxdb/all.gxi', '--repeats-basis-fa=/dev/fd/6'])


Traceback (most recent call last): File "/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/scripts/run_gx.py", line 1114, in main() File "/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/scripts/run_gx.py", line 1089, in main run_gx_pipeline(args) File "/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/scripts/run_gx.py", line 732, in run_gx_pipeline with ProcessPipeline() as p_main: File "/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/scripts/run_gx.py", line 312, in exit self.wait() File "/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/scripts/run_gx.py", line 302, in wait assert num_errors == 0, "Had errors." ^^^^^^^^^^^^^^^ AssertionError: Had errors.

etvedte commented 4 months ago

Does this error occur when running the test FASTA file and test database?

Please post the full commands you used

liangchengbo commented 4 months ago

This error also occur when running the test FASTA file and test database. Here are my commands.

/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/scripts/run_gx.py \ --fasta /mnt/z/lcb/genoms/bharal/0300.purge/purged.fa \ --tax-id 1204301 \ --gx-db /mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/database/gxdb/all \ --out-dir /mnt/z/lcb/genoms/bharal/0301.fcs-gx \ --bin-dir /mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release

etvedte commented 4 months ago

The message Fatal error: index.cpp:484 in from_stream(...): Unrecognized file content. indicates the gxdb path either does not contain the gx database or the content is corrupted.

What is the output of the following?

cd /mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/database/gxdb/
ls -l *

Post the files and file sizes of the ls command. It should look similar to:

-rw-rw-r-- 1 user group          187 Jan 24  2023 all.README.txt
-rw-rw-r-- 1 user group      8887448 Jan 24  2023 all.assemblies.tsv
-rw-rw-r-- 1 user group      8241107 Jan 24  2023 all.blast_div.tsv.gz
-rw-rw-r-- 1 user group 321216733352 Jan 24  2023 all.gxi
-rw-rw-r-- 1 user group 177317125807 Jan 24  2023 all.gxs
-rw-rw-r-- 1 user group         1652 Jan 31  2023 all.manifest
-rw-rw-r-- 1 user group           59 Jan 24  2023 all.meta.jsonl
-rw-rw-r-- 1 user group     22549956 Jan 24  2023 all.seq_info.tsv.gz
-rw-rw-r-- 1 user group      6385518 Jan 24  2023 all.taxa.tsv

Also please verify the integrity of the database as follows (provide actual paths to --dir and --mft args) dist/sync_files.py check --dir=/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/database/gxdb/ --mft=/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/database/gxdb/all.manifest

The expected output is:

===============================================================================
/mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/database/gxdb/ is up-to-date with /mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/database/gxdb/

In the command you posted in your most recent comment, it appears you are using the 'all' database, not the test --gx-db /mnt/z/lcb/fcs-gx/fcs-gx-release/fcs-gx-release/database/gxdb/all It may be that you did try using the test sets and are just copying the original command you used. Please verify the contents and integrity of the test database using the ls and sync_files.py commands, respectively.

liangchengbo commented 4 months ago

I've verify the database. I ran my code with the Integrated database. But still the same error was reported. The command run in the tutorial is ". /dist/run_gx". When I run this command, the program fails to run and displays the content of "--help". So I run the command ". /scripts/run_gx.py" and got the aforementioned error. Is this the cause of my reported errors? How should I modify it?

liangchengbo commented 4 months ago

I've verify the database. I ran my code with the Integrated database. But still the same error was reported. The command run in the tutorial is ". /dist/run_gx". When I run this command, the program fails to run and displays the content of "--help". So I run the command ". /scripts/run_gx.py" and got the aforementioned error. Is this the cause of my reported errors? How should I modify it?

liangchengbo commented 4 months ago

Should I extract the two .gz files from the database before running?

etvedte commented 4 months ago

I'm reopening this issue because you're seeing the same error message. I've copied your comments below, and see my responses:

I've verify the database. I ran my code with the Integrated database. But still the same error was reported.

Please paste the output of ls -l * in your GX database folder. I would like to see the contents and file sizes. Also, please state whether you used sync_files.py check to verify the database contents and whether you saw a similar message to my comment above.

The command run in the tutorial is ". /dist/run_gx". When I run this command, the program fails to run and displays the content of "--help". So I run the command ". /scripts/run_gx.py" and got the aforementioned error. Is this the cause of my reported errors?

You should be able to run either dist/run_gx or scripts/run_gx.py so long as --bin-dir is set to the dist directory containing the proper executables. Can you ls -l * in the folder you are setting as --bin-dir? I don't think this is the source of the error based on the message you are seeing, but would like to check.

Should I extract the two .gz files from the database before running?

No, this is not needed.

liangchengbo commented 4 months ago

Here is the information about the database. I changed the database directory to /mnt/z/lcb/fcs-gx/gxdb/all/ . If the error is due to the integrity of the database, can you provide an alternative way to download the database? I have tried several times to download the database using the script in fcs-gx. Also downloaded it by ftp from "ftp.ncbi.nlm.nih.gov" several times. But I have never been able to resolve the error reported.

(fcs-gx) lcb@zooeco-R282-Z93:/mnt/z/lcb/fcs-gx/gxdb/all$ ls -l * -rwxrwxrwx 1 lcb lcb 8887448 7月 21 21:01 all.assemblies.tsv -rwxrwxrwx 1 lcb lcb 8241107 7月 21 21:01 all.blast_div.tsv.gz -rwxrwxrwx 1 lcb lcb 321216733352 7月 22 01:13 all.gxi -rwxrwxrwx 1 lcb lcb 177389875999 7月 21 22:15 all.gxs -rwxrwxrwx 1 lcb lcb 1652 7月 21 21:01 all.manifest -rwxrwxrwx 1 lcb lcb 59 7月 21 21:01 all.meta.jsonl -rwxrwxrwx 1 lcb lcb 192 7月 21 21:01 all.README.txt -rwxrwxrwx 1 lcb lcb 22549956 7月 21 21:01 all.seq_info.tsv.gz -rwxrwxrwx 1 lcb lcb 6385518 7月 21 21:01 all.taxa.tsv

(fcs-gx) lcb@zooeco-R282-Z93:/mnt/z/lcb/fcs-gx$ /mnt/z/lcb/fcs-gx/dist/sync_files check --dir=/mnt/z/lcb/fcs-gx/gxdb/all/ --mft=/mnt/z/lcb/fcs-gx/gxdb/all/all.manifest =============================================================================== Source: /mnt/z/lcb/fcs-gx/gxdb/all Destination: /mnt/z/lcb/fcs-gx/gxdb/all Space check: Available:11.06TiB; Existing:464.41GiB; Incoming:464.34GiB; Delta:-69.37MiB

Computing md5 hash of /mnt/z/lcb/fcs-gx/gxdb/all/all.meta.jsonl ... c2096cdb8106d44a310052b06a23836c Skipping existing 59B all.meta.jsonl

/mnt/z/lcb/fcs-gx/gxdb/all/all.README.txt - file-size changed. Requires transfer: 187B all.README.txt

Computing md5 hash of /mnt/z/lcb/fcs-gx/gxdb/all/all.taxa.tsv ... c94d1fc80f81dbbf30b114d4cdaf29ad Skipping existing 6.09MiB all.taxa.tsv

Skipping existing 7.86MiB all.blast_div.tsv.gz

Skipping existing 8.48MiB all.assemblies.tsv

Computing md5 hash of /mnt/z/lcb/fcs-gx/gxdb/all/all.seq_info.tsv.gz ... 6a760eed5a94aaf46d4dd8c75f370875 Skipping existing 21.51MiB all.seq_info.tsv.gz

/mnt/z/lcb/fcs-gx/gxdb/all/all.gxs - file-size changed. Requires transfer: 165.14GiB all.gxs

Computing md5 hash of /mnt/z/lcb/fcs-gx/gxdb/all/all.gxi ... 1b77edf28321975b3b436466fa161f7d /mnt/z/lcb/fcs-gx/gxdb/all/all.gxi - checksum changed. Requires transfer: 299.16GiB all.gxi

etvedte commented 4 months ago

The presence of 'checksum changed' means that your downloaded files are likely corrupted, which would explain the GX error.

Please perform the following:

Download the test gx database:

sync_files.py get --mft=https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/test-only/test-only.manifest  --dir=/path/to/test_gxdb

=============================================================================== 
Source:      https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/test-only 
Destination: /path/to/test_gxdb 
Warning: aria2c is not accessible - will use curl instead (may be much slower). 
Space check: Available:3.07TiB; Existing:0B; Incoming:4.29GiB; Delta:4.29GiB
...
Removing /path/to/test_gxdb.lockfile. 
Done. 

Verify the test gx database:

sync_files.py check --dir=/path/to/test_gxdb --mft=/path/to/test_gxdb/test-only.manifest
=============================================================================== 
/path/to/test_gxdb is up-to-date with /path/to/test_gxdb.

Download an example FASTA and run:

curl -LO https://zenodo.org/records/10932013/files/FCS_combo_test.fa
run_gx.py --fasta=FCS_combo_test.fa --tax-id=4932 --gx-db=/path/to/test_gxdb --bin-dir=/path/to/fcs-gx/dist/
...
...
fcs_gx_report.txt action summary:
---------------------------------
                                seqs      bases
                               ----- ----------
TOTAL                              2       2000
-----                          ----- ----------
EXCLUDE                            2       2000

I just got this to work.

liangchengbo commented 4 months ago

When I download the database with the script, the following error is reported. Is this due to my environment configuration? I uploaded the database to my server after downloading it with ftp. Is my environment configuration something that will affect the integrity of my database?

File "/mnt/z/lcb/fcs-gx/./dist/sync_files", line 725, in main transfer_file(mi, src_mft_dir, work_dir) File "/mnt/z/lcb/fcs-gx/./dist/sync_files", line 588, in transfer_file subprocess.run(["curl", "-L", "-C", "-", "--retry", "5", "-o", tmp_file_path, url], check=True) File "/home/lcb/miniconda3/envs/fcs-gx/lib/python3.12/subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['curl', '-L', '-C', '-', '--retry', '5', '-o', PosixPath('/mnt/z/lcb/fcs-gx/gxdb/all_py.in_progress/all.gxs.part'), 'https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/latest/all.gxs']' returned non-zero exit status 56.

Removing /mnt/z/lcb/fcs-gx/gxdb/all_py.lockfile.

etvedte commented 4 months ago

Return code 56 in curl indicates a "Failure in receiving network data". This error typically occurs when there's a problem with the connection or data transfer. Specifically:

It means that the transfer was interrupted or failed before it could be completed. This can happen due to various reasons, such as:

Therefore, an alternative to sync_files.py get is to retrieve the db files from FTP using an alternate method, but you still want to check for the integrity of the db files using sync_files.py check

In your most recent comment, the error message suggests you are trying to get the alldatabase, when I specifically recommended you retrieve, verify and screen with the test-only database first. Please do that. The test-only database is small, so if that works ok then it is more likely a connection timeout issue versus the other reasons mentioned above.

Additionally, can you run GX in a Docker or Singularity container, following the instructions on our wiki? In other words, is there a specific reason you are trying to run GX outside of a container? The container has a resumable database download mechanism.

liangchengbo commented 4 months ago

My error has been fixed. As you said, the error was caused by a damaged database. Thank you very much for your help! I wish you much success in your future endeavors.

etvedte commented 4 months ago

Glad you got it to work!