gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
28 stars 8 forks source link

db_download --species argument #64

Closed Guillermouceda closed 1 month ago

Guillermouceda commented 4 months ago

Hello,

I am trying to download sequences from a set of species with the following code:

crabs db_download --source ncbi --database nucleotide --query 'ITS2[All Fields] AND its2[All Fields] AND "Internal Transcribed Spacer 2"[All Fields] AND ITS-2[All Fields] AND ("1"[SLEN] : "1000"[SLEN])' --species "Anacamptis papilionacea + Anacamptis coriophora + Lysimachia foemina + Anthemis rigida + Asphodelus ramosus + Bifora testiculata + Centaurium maritimum + Cistus creticus + Geranium purpureum + Hypochaeris glabra + Lagoecia cuminoides + Lavandula stoechas + Linum bienne + Lotus subbiflorus + Moraea sisyrinchium + Neotinea lactea + Ornithogalum montanum + Ornithogalum pyrenaicum + Ornithopus compressus + Pyrus spinosa + Ranunculus paludosus + Scorzonera sublanata + Senecio vulgaris + Serapias cordigera + Trifolium arvense + Trifolium campestre + Tuberaria guttata" --output ITS2_ref_lib_part1_ncbi_1_50000.fasta --keep_original yes --email guillermoug1203@gmail.com --batchsize 50000

However, I get an error when running the code as if the argument --species does not exist:

crabs: error: unrecognized arguments: --species Anacamptis papilionacea + Anacamptis coriophora + Lysimachia foemina + Anthemis rigida + Asphodelus ramosus + Bifora testiculata + Centaurium maritimum + Cistus creticus + Geranium purpureum + Hypochaeris glabra + Lagoecia cuminoides + Lavandula stoechas + Linum bienne + Lotus subbiflorus + Moraea sisyrinchium + Neotinea lactea + Ornithogalum montanum + Ornithogalum pyrenaicum + Ornithopus compressus + Pyrus spinosa + Ranunculus paludosus + Scorzonera sublanata + Senecio vulgaris + Serapias cordigera + Trifolium arvense + Trifolium campestre + Tuberaria guttata

When I check the the help message with: crabs db_download -h , indeed the argument does not exist for this module.

How is this possible? I followed the tutorial given in this githubpage. Like this one:

crabs db_download --source ncbi --database nucleotide --query '16S[All Fields] AND ("1"[SLEN] : "50000"[SLEN])' --species 'Carcharinus leucas + Carcharinus acronotus' --output 16S_carcharinus_leucas_acronotus_ncbi_1_50000.fasta --keep_original yes --email johndoe@gmail.com --batchsize 5000

gjeunen commented 4 months ago

Hello @Guillermouceda,

Thank you for using CRABS. It seems your code is correct and when copy-pasting your code it runs fine for me. One thing though, I wouldn't recommend setting the batch size so high. Anything over 5,000 and the NCBI servers will most likely close the connection.

For your error, can you let me know the version of CRABS you are working with? You can see this by entering crabs --version. The latest version is crabs v0.1.8. The function of specifying species was only added in a later version per request of another user. Hence, if you're using an outdated version, the CRABS code might not have the ability for you to specify the --species parameter, which would result in the error you see.

Best wishes, Gert-Jan

Guillermouceda commented 4 months ago

Hello @gjeunen,

Thank you for answer. Indeed it was a problem of the version. Now, I have installed CRABS with Docker and it works. How ever, I have another problem. When I run the command eveything seems to work fine but I can't find the output file anywhere in my computer. See a screenshot of my terminal:

image

This is the command to run it from docker:

docker run crabs crabs db_download --source ncbi --database nucleotide --query 'ITS2[All Fields] AND its2[All Fields] AND "Internal Transcribed Spacer 2"[All Fields] AND ITS-2[All Fields] AND ("1"[SLEN] : "1000"[SLEN])' --species "Anacamptis papilionacea + Anacamptis coriophora + Lysimachia foemina + Anthemis rigida + Asphodelus ramosus + Bifora testiculata + Centaurium maritimum + Cistus creticus + Geranium purpureum + Hypochaeris glabra + Lagoecia cuminoides + Lavandula stoechas + Linum bienne + Lotus subbiflorus + Moraea sisyrinchium + Neotinea lactea + Ornithogalum montanum + Ornithogalum pyrenaicum + Ornithopus compressus + Pyrus spinosa + Ranunculus paludosus + Scorzonera sublanata + Senecio vulgaris + Serapias cordigera + Trifolium arvense + Trifolium campestre + Tuberaria guttata" --output ITS2_ref_lib_part1_ncbi_1_50000.fasta --keep_original yes --email guillermoug1203@gmail.com --batchsize 1000

This means that there has been an error. or What's the problem? I have tried to search for the file with find command with no output. Also i would like to know how the.txt of the species list should look like as I could not run the command with the file. Mine looks like this.

image

DO they have to have a certain symbol at the beginning of the name?

Thank you for your reply.

hughcross commented 4 months ago

Hi @Guillermouceda, I think I see the problem. With Docker you have to specify what folders to connect with on your computer. A Docker container is running a mini-operating system separate from your computer, so if you don't tell it how to connect to your computer it will run the command but the outputs will be inside the Docker container and will disappear when the command is finished.

There are some examples on our tutorial, but here is a quick description. Here I will put just the Docker part of the command:

docker run --rm -it \
  -v $(pwd):/data \
  --workdir="/data" \
  quay.io/swordfish/crabs:0.1.7 \
  crabs db_download \
  --source ncbi \
  (rest of crabs command...)

For your issue, the critical arguments are -v and --workdir. The -v tells you what directory on your computer to connect to which directory on the docker container. In the example above, the current directory ($(pwd)) will move into the /data folder inside the docker container; the --workdir is the working directory in the docker container, which in this example is /data. By adding the --workdir (or also -w), all outputs from the crabs command will end up in whatever directory you are running the command in.

There are additional examples on our tutorial. I would suggest also including the --rm, which will remove the container after the command is finished. I don't know the details, but without including this you could have a bunch of active containers taking up memory on your computer, and they are difficult to track down. They could be taking up memory without you knowing about. Check the tutorial for explanations of the other commands. Obviously, there are a lot of different options for running docker, but we tried to keep it simple so you can get your Crabs work done.

Regarding your second question, @gjeunen may have a quick answer. maybe we should put an example text file on this repo.

Let me know if this is still not working.

Cheers,

hugh

Guillermouceda commented 3 months ago

Hi @hughcross ,

Thank you for your answer. I will try to implement some changes in the code to include what you suggested. I am new to Docker and I did not know. Regarding my second question I will wait @gjeunen answer

Thank a lot again

gjeunen commented 3 months ago

Hello @Guillermouceda,

Apologies, but what is your second question?

Thanks, Gert-Jan

Guillermouceda commented 3 months ago

Hello @gjeunen,

My question is related to how the .txt file with the species list should look like. I have tried to run the code the including the flag for .txt file, however I got an error message. Mine species.list file look like this: image

Is it well formatted ? after @hughcross answer, I am thinking that the error might be due to this "With Docker you have to specify what folders to connect with on your computer". Am I correct? As i did not specify any working directory the command did not find the file

gjeunen commented 3 months ago

That formatting looks correct. Could you please let me know if you have the same error after specifying the folder of the file with Docker?

Guillermouceda commented 3 months ago

Hi @gjeunen after specifying the folder everything went fine. Now I am having troubles with the pga command. it looks rather stupid problem that I cannot solved. This is my code:

    docker run --rm -it \
        -v $(pwd):/data \
         --workdir="/data" \
        crabs \
        crabs pga \
         --input ITS2_ref_lib_ncbi.fasta \ 
         --output pga.fasta \
         --database TS2_ref_lib_seq_extracted.fasta \
         --fwd ATGCGATACTTGGTGTGAAT \
         --rev GACGCTTCTCCAGACTACAAT \
         --speed medium \
         --percid 0.95 \
         --coverage 0.95 \
         --filter_method relaxed

I get a stupid error: image

What's causing this error?

Thank a lot for the help

gjeunen commented 3 months ago

Hello @Guillermouceda,

@hughcross, could you check this, as I think this is Docker related? Can't recreate this issue using the GitHub or conda versions.

Cheers, Gert-Jan

Guillermouceda commented 3 months ago

Hello @gjeunen ,

I found the issue. There was an extra space after one of the lines that was stopping the code. However, I have one question. After running the in_silico PCR and the pga only 4 sequences out of c. 400 survived. Could there be a problem with my NCBI query?

I am interested in downloading all the ITS-2 sequences for the follwoing set of species:

species_list_complete.txt

My query to NCBI was:

'ITS2[All Fields] AND its2[All Fields] AND "Internal Transcribed Spacer 2"[All Fields] AND ITS-2[All Fields] AND ("1"[SLEN] : "1000"[SLEN])'

I do not know if there is a way to refine the query.

Does this means that the sequences I downloaded from NCBI were sequenced by other primer sets (i.e., different region in the same gene) and therefore CRABS is not able to retrieve them? In other words, the region amplified by my primers is out of the region contained in the NCBI sequences? I guess if this is the issue it can only be solved by increasing the sequencing effort with our primers, right?

Do you think that querying other database (e.g., BOLD) for the same set of species will help?

We are a bit short in budget and therefore trying to get most of the sequences from the databases is essential for us.

I run the pga as stated in this quote from another issue:

The crabs pga function uses the output file from the insilico_pcr function as input file and the input file from the insilico_pcr function as a database (initially downloaded sequence file). CRABS will automatically subset the data from the raw sequence file (database) to remove the sequences for which the amplicon was already retrieved (input). CRABS will then run a pairwise global alignment against all sequences with the input file as seed sequences in the aim to retrieve amplicons for which the primer-binding regions are missing. The output file from the crabs pga function will contain the sequences retrieved during the insilico_pcr step, as well as the newly retrieved amplicons during the pga step. There is no need to merge the output of crabs insilico_pcr and crabs pga.

Thanks a lot for the help :-).

gjeunen commented 3 months ago

Hello @Guillermouceda,

It seems that your NCBI query is correct. When looking at NCBI, it returns 482,805 sequences without subsetting it for your species list. When running the db_download code, I am downloading 342 sequences. However, I fail to retrieve any sequences during the insilico_pcr step. Are these the correct primers: (-f ATGCGATACTTGGTGTGAAT -r GACGCTTCTCCAGACTACAAT)?

Thanks, Gert-Jan

Guillermouceda commented 3 months ago

Hello @gjeunen,

The primer sequences are correct. they are actually from a publication in which they test their validity:

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0008613#s5

In the supplementary information you have the sequences, which I have copy pasted from the file provided:

Thank for you help.

gjeunen commented 3 months ago

Hello @Guillermouceda,

I managed to download the 342 sequences for your species. Next, I managed to locate the forward primer in 316 sequences. However, I cannot find the reverse primer in the remainder of the amplicon for any of the 316 remaining sequences. The sequences are currently around 300 to 350 bp long. At first thought, this could potentially be explained by 3 scenarios:

  1. The reverse primer was used for barcoding and, hence, has been removed by the depositor as it is artificial. To check this, you might want to look into the publication of the depositing sequence to see if they have listed the barcoding primers that were used or contact the scientist who deposited the barcode sequence.
  2. The reverse primer does not work on your species of interest and you will need to design a new reverse primer that is compatible with your taxonomic group of interest.
  3. There is an unfortunate mistake in the sequence of the reverse primer that was published in the paper you provided.

My apologies that I can't be of more help to solve your issue.

Best wishes, Gert-Jan