hsinnan75 / StrainPro

MIT License
7 stars 3 forks source link

What ist reference-fna #6

Open termithorbor opened 4 years ago

termithorbor commented 4 years ago

Hi all, what os reference-fna in this case?

$bin/StrainPro-build -r reference-fna -o ref_idx [ref_idx is the output folder for BWT indexes]

Thanks for your help :)

hsinnan75 commented 4 years ago

After you run download_taxonomy.sh to download a group of genome sequences, you will get a file called "library.fna" in the folder of "database/$library_name", where $library could be : archaea, bacteria, viral, fungi and human. The fna file is the input file for StrainPro-build. Or you can compile your own database with the following format:

taxid|562|genome1 ATCCCGGCCCCGGCAGAACCGACCTATCGTTCTAACGTAAACGT.... taxid|562|genome2 ATCCCGGCCCCGGCCCCCAAGACCTATCGTTCTAACGTAAACGT....

termithorbor commented 4 years ago

Okay but when I try to download the bacteria genomes I get this error:

rsync_from_ncbi.pl: unexpected FTP path (new server?) for na

I am using your package from a ubuntu server at work.

hsinnan75 commented 4 years ago

I tried the script again on a ubuntu server and it worked fine. I'm not sure what the issue you encountered. Is there any more messages when you ran the script?

termithorbor commented 4 years ago

This is all I get:

./download_genomic_library.sh bacteria rsync_from_ncbi.pl: unexpected FTP path (new server?) for na

The sript is creating the folders and there is the assembly_report.txt in the bacteria folder but nothing else.

Interesting note: For human, fungi and viral it worked. It is not working for bacteria somehow...

hsinnan75 commented 4 years ago

If you download all complete genomes from RefSeq directly, the taxonomic information will not be included in the files. The script was modified from kraken and I found a solution in this thread. I updated the script, or you can modify rsync_from_ncbi.pl as follows:

[add] if ( $full_path =~/^na/) { next }

image

termithorbor commented 4 years ago

Thank you very much. Now it seems to work but step 1 takes quite long - is that normal? And can I add my own genomes (from strains not public yet) to the fna file to have them in the analysis as well? Should work right?

hsinnan75 commented 4 years ago

Sure, you can do that, and you have to follow the input format to add taxid for those genomes. If you don't know their taxids, you need to create a pseudo-taxid for each genome and add those taxids to taxonomy/nodes.dmp and taxonomy/names.dmp, otherwise StrainPro will not be able to recognize those genomes. When you create pseudo taxids, please make sure you also specify their paraent taxids to known ones.

termithorbor commented 4 years ago

Thanks angain. Two final questions: 1) is it normal that step 1 of ./download_genomic_library.sh bacteria takes quite long? 2) How are strains which are in the sample but not in the db assigned?

hsinnan75 commented 4 years ago
  1. Yes, it would take a long time to download bacteria genomes since the database is more than 50GB.
  2. If your sample contains novel strains, StrainPro will identify their parent taxids. For example, if your sample contains a novel E.coli strain, StrainPro is not able to identify this strain since there is no such annotation. StrainPro will use the known sequence fragments to assign it the taxid of E.coli.

Please note, if your sequencing data is not deep enough, when running StrainPro-map, use "-depth" and "-freq" to set appropriate values for strain identification. This tool is still under development, if you have any suggestions, please let us know. Thank you!

termithorbor commented 4 years ago

Okay but then you only see 1 Ecoli for probably 2 or more strains if they are unknown Ecoli? And does your program recognize forward and reverse reads into account? Or only forward or only reverse at a time?

hsinnan75 commented 4 years ago

If the unknown strain is highly similar to a known strain S, StrainPro will probably identify S instead, otherwise the best result is its species level identification. We consider both forward and reverse reads at the same time when performing read mapping, similar to regular read mapping methods.

termithorbor commented 4 years ago

So here I provide the forward and reverse reads seperated by space and they should have the same name but one has F and the other R in it or how do I specify which is the forward and which is the reverse one? Thanks again for your answers :)

$bin/StrainPro-map -i idx -f read -o profile.txt

hsinnan75 commented 4 years ago

forward and reverse reads should be separated into different read entries. If your sequencing data is paired-end, there should be two separated fastq / fasta files, one for each end. It does not matter which one is forward or reverse, but they must be consistent. They are not supposed to be mixed together.

termithorbor commented 4 years ago

Thanks again. My index folder contains those files after indexing. Is it normal that there is only one index with content?

image

and when I give the path to the folder like this: bin/StrainPro-map -i /folder/folder/StrainPro/ref_idx_bac -f /folder/folder/folder/straintestF.fq /folder/folder/folder/straintestR.fq -o /folder/folder/StrainPro/test_profile.txt

I get the followig error:

Error! Please specify a valid reference index or direcotry with indices!

bin/StrainPro-map v0.9.8 Usage: bin/StrainPro-map -i Index_Prefix -f <ReadFile_1 ReadFile_2 ...> -o OutputPrefix

Options: IndexPrefix can be either an index prefix or a directory of multiple indexes -t INT number of threads [16] -n INT maximal mismatches in an alignment [3] -depth INT minimal read depth of sequenced data [20] -freq INT minimal read count for identified genomes [50] -dump STRING dump file path

What am I doing wrong there?

hsinnan75 commented 4 years ago

It looks like there is something wrong while you were making reference database. It's supposed to have multiple clusters with different taxonomic levels. In my cases, I generated 28 clusters for bacteria genomes and each of them has different size ranging from 1G to 4G. Could you please show me how you ran StrainPro-build, thank you.

termithorbor commented 4 years ago

It seems to be a server harddrive problem. We are working on that already.

termithorbor commented 4 years ago

Now it seems to work but my bacteria db is only 27 GB (library.fna) and not 50 as you said and I only get 17 different indexes and not 28 - is that still right?

I found out that was also a storage problem when downloading it. Downloaded it again and now it has library.fna is ~81.94 GB - is that right? And is there a way to tell the program to which location to download the database?

hsinnan75 commented 4 years ago

It looks like the database is growing fast. 50G was around 18 months ago. The script downloads databases from ftp.ncbi.nlm.nih.gov. Thus the script was written based on the structure of the databases and on the file format. If you'd like to download from another source, the script would not work on the data source. However, you could compile your database manually if the file format meets our requirement.

termithorbor commented 4 years ago

Hi,

I did run the whole program now as a test with only one Ecoli strains in my sample but this is the result I get - what is going on there?

Rank TaxID Read_count Depth Abundance Confidence subspecies 316435 81 81 14.86 0.801980 subspecies 440524 101 20 3.67 0.205703 subspecies 526982 249 32 5.87 0.325065 subspecies 526983 474 34 6.24 0.345985 subspecies 526986 650 37 6.79 0.366610 subspecies 526987 181 44 8.07 0.441463 subspecies 526991 97 34 6.24 0.343972 subspecies 526992 231 29 5.32 0.290201 subspecies 526994 85 30 5.50 0.304659 subspecies 527028 296 32 5.87 0.323144 subspecies 527029 60 28 5.14 0.285714 subspecies 527030 189 29 5.32 0.288991 subspecies 527032 188 28 5.14 0.284418 subspecies 579405 55 50 9.17 0.504587 subspecies 1242101 86 37 6.79 0.375546 species 197 319 20 0.02 0.204225 species 545 92 9 0.01 0.092000 species 562 1422066 50 96.48 0.500311 species 564 671 34 0.05 0.346055 species 573 1715 51 0.12 0.512859 species 621 239 24 0.02 0.239000 species 622 4778 30 0.32 0.300806 species 623 33770 31 2.29 0.315634 species 624 735 30 0.05 0.300983 species 630 57 5 0.00 0.057000 species 1396 1967 35 0.13 0.346547 species 1405 76 7 0.01 0.076000 species 1428 733 30 0.05 0.300287 species 28901 1361 40 0.09 0.399354 species 64104 199 20 0.01 0.199000 species 69223 55 5 0.00 0.055000 species 208962 4467 31 0.30 0.314223 species 1813821 208 21 0.01 0.208000 species 2686305 128 12 0.01 0.128000 species 2725417 126 12 0.01 0.126000 species 2725997 114 11 0.01 0.114000 genus 194 319 16 0.02 0.319000 genus 544 92 4 0.01 0.092000 genus 561 1427783 50 96.07 0.998488 genus 570 1715 51 0.12 1.000000 genus 590 1615 37 0.11 0.739638 genus 620 51026 32 3.43 0.637646 genus 629 57 2 0.00 0.057000 genus 1386 2975 33 0.20 0.663988 genus 204037 55 2 0.00 0.055000 genus 447792 540 27 0.04 0.540000 family 543 1525530 48 99.77 1.000000 family 72294 319 10 0.02 0.319000 family 186817 2975 33 0.19 0.995983 family 1903410 55 1 0.00 0.055000 family 1903411 168 5 0.01 0.168000 order 1385 2975 33 0.19 1.000000 order 91347 1526322 48 99.78 1.000000 order 213849 319 8 0.02 0.319000 class 1236 1526322 48 99.78 1.000000 class 29547 319 6 0.02 0.319000 class 91061 2975 33 0.19 1.000000 phylum 1224 1526641 48 99.81 1.000000 phylum 1239 2975 33 0.19 1.000000 kingdom 2 1529616 48 100.00 1.000000

hsinnan75 commented 4 years ago

Thank you for reporting this issue. Have you tried running our test sample? (run_test.sh). There is also a single E.coli strain in the test sample. Could you paste the output? And is it possible that you share me a link to download your test data, I'll find out the problems. Thank you!

termithorbor commented 4 years ago

I run my own test sample where I basically run a whole geome shotgun sequencing of a signle ecoli strain which we have sequenced.

And I fear I am sadly not allowed to share the data.

I also did run run_test.sh before and everything worked fine.

hsinnan75 commented 4 years ago

Could you please map your short read data with an e.coli sequence as the reference and observe the mapping sensitivity. I was wondering how many of the short reads will be aligned to the reference sequence. Thank you!

termithorbor commented 4 years ago

Do you have any preferences for a mapping tool?

Sidenote: the StrainSeeker tool identifies my testsample as 100 % Ecoli unknown strain.

hsinnan75 commented 4 years ago

Based on the output, it is estimated 96.48% (abundance) is E.coli, but it is very wired that some of reads were assigned to other species. I randomly generated four different test sets using whole bacteria database and the precision/recall at species level was as high as 100%. We also tested on some real datasets and the accuracy was also the highest among the selected existing tools. I will generate another dataset based on your output to test StrainPro again. I'll let you know if I have the result. Thank you!

hsinnan75 commented 4 years ago

Could you run StrainPro again with the following arguments: -n 1 -depth 30 -freq 100 and let me know the outoput? thank you!