Sequence identifiers - Githubissues

tk2 commented 1 year ago

Hi - I'm running the fcsgx.py for contamination screening of some rodent long read assemblies. I get the test data to run successfully, all fine. When I go to run it on my own sequences, I get URL errors. I have isolated the problem down to the sequence identifiers - if I use the same sequence identifier as the test data, everything runs fine with the different taxonomy. Could you comment on whether the script is expecting a particular sequence identifier scheme? I couldn't see it noted anywhere in the docs.

etvedte commented 1 year ago

Hello,

Can you please paste examples of sequence identifiers that are causing issues with FCS-GX?

tk2 commented 1 year ago

Sure, it was a really simple FASTA header: >SUPER_23 And the pipeline runs when I swap it with the test data header: >gi|1331740640|gb|JPZV02000012.1| Blattella germanica strain American Cyanamid = Orlando Normal contig_12, whole genome shotgun sequence

Here is the URL 404 full error I was getting:

Traceback (most recent call last): File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/Bazel.runfiles_lc5mktr2/runfiles/cgr_fcs/apps/private/retrieve_db/retrieve_db.py", line 330, in sys.exit(main()) File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/Bazel.runfiles_lc5mktr2/runfiles/cgr_fcs/apps/private/retrieve_db/retrieve_db.py", line 315, in main gx.run( File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/Bazel.runfiles_lc5mktr2/runfiles/cgr_fcs/apps/private/retrieve_db/retrieve_db.py", line 265, in run self.check_gx_db(gx_db, disk_index_path, gx_index_src, gx_ftp_basename) File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/Bazel.runfiles_lc5mktr2/runfiles/cgr_fcs/apps/private/retrieve_db/retrieve_db.py", line 138, in check_gx_db file_sizes = self.check_fs_space(gx_ftp_basename) File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/Bazel.runfiles_lc5mktr2/runfiles/cgr_fcs/apps/private/retrieve_db/retrieve_db.py", line 70, in check_fs_space needed_size, file_sizes = self.fetch_manifest(gx_ftp_basename + self.gx_db_ftp_folder + self.gx_db_name + ".manifest") File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/Bazel.runfiles_lc5mktr2/runfiles/cgr_fcs/apps/private/retrieve_db/retrieve_db.py", line 40, in fetch_manifest with urllib.request.urlopen(ftp_loc, context=ctx) as mr: File "/usr/lib/python3.9/urllib/request.py", line 214, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.9/urllib/request.py", line 523, in open response = meth(req, response) File "/usr/lib/python3.9/urllib/request.py", line 632, in http_response response = self.parent.error( File "/usr/lib/python3.9/urllib/request.py", line 561, in error return self._call_chain(args) File "/usr/lib/python3.9/urllib/request.py", line 494, in _call_chain result = func(args) File "/usr/lib/python3.9/urllib/request.py", line 641, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 404: Not Found Traceback (most recent call last): File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/../run_fcsgx.py", line 309, in sys.exit(main()) File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/../run_fcsgx.py", line 299, in main gx.run() File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/../run_fcsgx.py", line 201, in run self.run_retrieve_db() File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/../run_fcsgx.py", line 119, in run_retrieve_db self.safe_exec(retrieve_db_args) File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/../run_fcsgx.py", line 45, in safe_exec subprocess.run(args, shell=False, check=True, text=True, stdout=sys.stdout, stderr=sys.stderr) File "/homes/tk2/miniconda3/lib/python3.9/subprocess.py", line 528, in run

etvedte commented 1 year ago

Can you try a couple other things?

1) Can you look and report what the line endings are in your file? Are they the same in your original/modified file?

2) Can you re-run both cases, making sure you are using the same sets of parameters, and post the full command you used?

tk2 commented 1 year ago

OK. I'll put my hand up here, my bad. The fasta was produced via a sequence curation tool that was running on a windows machine. Running dos2unix over the fasta file has resolved the issue.

tk2 commented 1 year ago

Apologies for re-opening. It appears the issue I was having was not due to Windows line-endings. So I can run the test data no problem - however when I change the --gx-db parameter from anything other than the test data setting (--gx-db "${SHM_LOC}/gxdb/test-only"), then I get this URL error.

If I run my job with the same --gx-db paramter as the test but with my own fasta file, then I get this warning which I don't totally understand, but I'm guessing it still thinks that it should be looking for a bacteria sample:

Processed 942 queries, 2896.61Mbp in 78.531s. (36.8849Mbp/s) Source file /output-volume//AreH2_curated.38679.taxonomy.rpt.tmp

- - Asserted tax-div, anml:rodents, conflicts with highest-coverage one, prok:CFB group bacteria. *

primary-divs: ['anml:rodents'] (0%) Top represented divs: prok:CFB group bacteria 3159372 bp

- - Warning: Fraction of primary-div = 0%, below 66% *

Aggregate coverage: 0%

tk2 commented 1 year ago

Apologies for re-opening. It appears the issue I was having was not due to Windows line-endings. So I can run the test data no problem - however when I change the --gx-db parameter from anything other than the test data setting (--gx-db "${SHM_LOC}/gxdb/test-only"), then I get this URL error.

If I run my job with the same --gx-db paramter as the test but with my own fasta file, then I get this warning which I don't totally understand, but I'm guessing it still thinks that it should be looking for a bacteria sample:

Processed 942 queries, 2896.61Mbp in 78.531s. (36.8849Mbp/s) Source file /output-volume//AreH2_curated.38679.taxonomy.rpt.tmp

- - Asserted tax-div, anml:rodents, conflicts with highest-coverage one, prok:CFB group bacteria. *

primary-divs: ['anml:rodents'] (0%) Top represented divs: prok:CFB group bacteria 3159372 bp

- - Warning: Fraction of primary-div = 0%, below 66% *

Aggregate coverage: 0%

etvedte commented 1 year ago

Can you please provide some more information about your compute environment....what OS are you using, are you running on a VM, are you running with the Docker or Singularity image?

Can you verify your Python version?

What method did you use to create a shared memory space?

Can you post your complete run_fcsgx.py command?

tk2 commented 1 year ago

Sure. I'm running RockyLinux 8.5 using the Singularity image. Python is 3.5.9. I'm using a network disk (nfs) for the shared memory space. Here is the command I run:

export TMPDIR=$PWD; export SHM_LOC=$PWD/shm_loc; python3 ./run_fcsgx.py --fasta ./AreH2_curated.fa --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/test-only" --gx-db-disk ./gxdb --split-fasta --tax-id 38679 --container-engine=singularity --image=fcsgx.sif

boukn commented 1 year ago

That is the command that you said worked? What exactly was it when it failed?

tk2 commented 1 year ago

Yes, this works for both the test data and my own data (changing the --fasta option). If I modify the --gx-db to a different folder, then I get the URL error above. I realise this makes no sense, but that's what happens.

tk2 commented 1 year ago

And when I say works for my own data, I mean it executes but I get the warning message above.

etvedte commented 1 year ago

Using your own data with the test-only db I expected to see the result you reported above...i.e. the test-only set has only some prokaryote sequences in it and so GX wouldn't be able to assign these sequences as "rodent." The test-only db is meant to be used only with the test example FASTA provided in the wiki and shouldn't be used with your own data.

Did you set --gx-db to --gx-db "${SHM_LOC}/gxdb/all? Did you initialize the directory with mkdir -p "${SHM_LOC}/gxdb"?

etvedte commented 1 year ago

As mentioned in the above comment please post the entire command for the run that produced that failed with the URL error

tk2 commented 1 year ago

OK, solved! It wasn't clear that the --gx-db must be set to --gx-db "${SHM_LOC}/gxdb/all. It is running happily now. This is probably an RTFM moment, it wasn't clear that /all suffix was critical. Thanks.

etvedte commented 1 year ago

Glad to hear that it is working! It is indeed on the wiki but we will see whether it needs to be more plainly clear for users.

ncbi / fcs

Sequence identifiers #17