Closed tk2 closed 1 year ago
Hello,
Can you please paste examples of sequence identifiers that are causing issues with FCS-GX?
Sure, it was a really simple FASTA header: >SUPER_23 And the pipeline runs when I swap it with the test data header: >gi|1331740640|gb|JPZV02000012.1| Blattella germanica strain American Cyanamid = Orlando Normal contig_12, whole genome shotgun sequence
Here is the URL 404 full error I was getting:
Traceback (most recent call last):
File "/nfs/research/keane/user/tk2/202210-contam/LeuH2/Bazel.runfiles_lc5mktr2/runfiles/cgr_fcs/apps/private/retrieve_db/retrieve_db.py", line 330, in
Can you try a couple other things?
1) Can you look and report what the line endings are in your file? Are they the same in your original/modified file?
2) Can you re-run both cases, making sure you are using the same sets of parameters, and post the full command you used?
OK. I'll put my hand up here, my bad. The fasta was produced via a sequence curation tool that was running on a windows machine. Running dos2unix over the fasta file has resolved the issue.
Apologies for re-opening. It appears the issue I was having was not due to Windows line-endings. So I can run the test data no problem - however when I change the --gx-db parameter from anything other than the test data setting (--gx-db "${SHM_LOC}/gxdb/test-only"), then I get this URL error.
If I run my job with the same --gx-db paramter as the test but with my own fasta file, then I get this warning which I don't totally understand, but I'm guessing it still thinks that it should be looking for a bacteria sample:
Processed 942 queries, 2896.61Mbp in 78.531s. (36.8849Mbp/s) Source file /output-volume//AreH2_curated.38679.taxonomy.rpt.tmp
primary-divs: ['anml:rodents'] (0%) Top represented divs: prok:CFB group bacteria 3159372 bp
Aggregate coverage: 0%
Apologies for re-opening. It appears the issue I was having was not due to Windows line-endings. So I can run the test data no problem - however when I change the --gx-db parameter from anything other than the test data setting (--gx-db "${SHM_LOC}/gxdb/test-only"), then I get this URL error.
If I run my job with the same --gx-db paramter as the test but with my own fasta file, then I get this warning which I don't totally understand, but I'm guessing it still thinks that it should be looking for a bacteria sample:
Processed 942 queries, 2896.61Mbp in 78.531s. (36.8849Mbp/s) Source file /output-volume//AreH2_curated.38679.taxonomy.rpt.tmp
primary-divs: ['anml:rodents'] (0%) Top represented divs: prok:CFB group bacteria 3159372 bp
Aggregate coverage: 0%
Can you please provide some more information about your compute environment....what OS are you using, are you running on a VM, are you running with the Docker or Singularity image?
Can you verify your Python version?
What method did you use to create a shared memory space?
Can you post your complete run_fcsgx.py
command?
Sure. I'm running RockyLinux 8.5 using the Singularity image. Python is 3.5.9. I'm using a network disk (nfs) for the shared memory space. Here is the command I run:
export TMPDIR=$PWD; export SHM_LOC=$PWD/shm_loc; python3 ./run_fcsgx.py --fasta ./AreH2_curated.fa --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/test-only" --gx-db-disk ./gxdb --split-fasta --tax-id 38679 --container-engine=singularity --image=fcsgx.sif
That is the command that you said worked? What exactly was it when it failed?
Yes, this works for both the test data and my own data (changing the --fasta option). If I modify the --gx-db to a different folder, then I get the URL error above. I realise this makes no sense, but that's what happens.
And when I say works for my own data, I mean it executes but I get the warning message above.
Using your own data with the test-only db I expected to see the result you reported above...i.e. the test-only set has only some prokaryote sequences in it and so GX wouldn't be able to assign these sequences as "rodent." The test-only db is meant to be used only with the test example FASTA provided in the wiki and shouldn't be used with your own data.
Did you set --gx-db to --gx-db "${SHM_LOC}/gxdb/all? Did you initialize the directory with mkdir -p "${SHM_LOC}/gxdb"?
As mentioned in the above comment please post the entire command for the run that produced that failed with the URL error
OK, solved! It wasn't clear that the --gx-db must be set to --gx-db "${SHM_LOC}/gxdb/all. It is running happily now. This is probably an RTFM moment, it wasn't clear that /all suffix was critical. Thanks.
Glad to hear that it is working! It is indeed on the wiki but we will see whether it needs to be more plainly clear for users.
Hi - I'm running the fcsgx.py for contamination screening of some rodent long read assemblies. I get the test data to run successfully, all fine. When I go to run it on my own sequences, I get URL errors. I have isolated the problem down to the sequence identifiers - if I use the same sequence identifier as the test data, everything runs fine with the different taxonomy. Could you comment on whether the script is expecting a particular sequence identifier scheme? I couldn't see it noted anywhere in the docs.