ohlab / SMEG

Strain-level Metagenomic Estimation of Growth rate (SMEG) measures growth rates of microbial strains from complex metagenomic dataset
17 stars 6 forks source link

Error while running example test #5

Closed ShriramHPatel closed 4 years ago

ShriramHPatel commented 4 years ago

Hi SMEG dev,

I really appreciate your effort making this tool open source for community. And I really enjoyed reading SMEG paper.

I was trying to tun Example test which involves 12 strains of F. prausnitzii and I end up with following error. However, even after this error terminal does not halt, instead it is running forever.

And just to know, I have installed SMEG as suggested via Conda installation.

Appreciate your help in this.

**####### Code

(SMEG) $ smeg build_species -g . -o test_database -a -p 16 ################ Checking for dependencies ######## gcc found parallel found R found Mauve found roary found prokka found bowtie2 found samtools found bamtools found bedtools found blastn found All required packages found ################ Checking for required R libraries ######## Output directory ok R libraries ok build_species option activated /data/SMEG/SMEG-1.1.1/test is present directory /data/SMEG/SMEG-1.1.1/test is the genomes directory /data/SMEG/SMEG-1.1.1/test/test_database is the output directory ####### PRE PROCESSING GENOMES #####

Complete reference genome(s) detected ####### RE-ORDERING CONTIGS OF DRAFT GENOME(S) (MAUVE)#####

ls: cannot access alignment*/A2-165.fna.fas: No such file or directory**

Additionally I think there is a typo at line 40 build_sp script

num_of_contigs=$(grep -c ">" $f) ## generated error message from line 65 num_of_contigs=$(grep -c ">" $GDR/$f) ### changed to this and it resolved.

Thanks in advance, Shriram

aemiol commented 4 years ago

Hi Shriram, thanks for your message. The error is strange especially given most people successfully run the "example test". Would it be possible for you to run SMEG via singularity? What OS do you have?

Cheers, Tunde

ShriramHPatel commented 4 years ago

Thanks for your prompt reply,

I can take a look into using SMEG via singularity. And my OS is linux- Ubuntu 18.04.

But do you think this has to do something with the "rename" package,? Because being non-root user on server, I compiled and installed "rename" package manually (and not sudo apt-get).

I will update you soon. Thanks again,

ShriramHPatel commented 4 years ago

Heya,

I just figures out that the error I kept getting was due to "rename" package. And after getting it properly installed via root, example tests worked out well for me.

But can I hijack this thread to ask different question related to reference based SMEG measurements adapted on two different strains of S. epidermidis in the paper? I can open a new issue as well if you want.

Cheers, Shriram

aemiol commented 4 years ago

Sure, you can ask

ShriramHPatel commented 4 years ago

In paper where replication rate of two different strains of S. epidermidis in response to erythromycin was measured following in-vitro experiment, it is mentioned that reference based approach was used.

So, in that case, whether (a) SMEG database was constructed with (S. epidermidis) genomes downloaded from NCBI along with these two strains and then adapted reference based approach providing genome list as -g arguments (?) or (b) SMEG database was constructed with these two strains only (?).

And if using (a) approach, how many different genomes to consider from NCBI (only complete or all) when we want to adapt same approach on different reference strains where we have idea that interested strains are closely related (?).

Hope your insights in this will help other SMEG users as well. Thanks in advance.

Cheers, Shriram

aemiol commented 4 years ago

Hi Shriram, Your option (a) is correct. In the paper, we created the database using strains downloaded from NCBI (which also contained our strains of interest). Generally, for species with thousands of strains (e.g E. coli or S. aureus), I recommend building the database using only complete genomes. We estimated replication rate by providing a file listing the strains of interest ( -g argument). I think your option B is also a viable alternative but I personally didn't try it.

To save runtime during the build process, you have the option of creating a database applicable for only reference-based approach by specifying the "-e" flag (and not -a flag). The below command can then be used to estimate replication rates smeg growth_est -m 1 -o -r -s -g -p 8 -n 3

If you created the species_DB using the "-a flag", simply provide ANY of the generated DBs (e.g. T.0.6) for use with the reference approach.

Regarding your last point, provided there are sufficient SNPs (> 100) between your strains of interest, I think you should be fine. Obviously, more SNPs will give more accurate estimates.

ShriramHPatel commented 4 years ago

Thank you very much for your valuable inputs. I confirm that, in reference-based database building -e flag certainly saved much of my run-time.

I have a small query here: In species database building, SNPs uniquely identified for particular strain/ cluster (and the ones used for growth estimation) are only unique relative to the genomes included in the database. Right? Please correct me if I am not clear or mis-interpreting it.

I know that strain coverage is calculated using only available unique SNPs as suggested in the README. But still do you suggest any coverage cutoff for SNPs? Because I have noted high SMEG measurements despite of relatively low coverage (Strain_B) compared to other strains (Strain_C).

Strain SMEG Coverage No of SNPs SMEG range
Strain_A 1.23216372772912 0.80615763546798 886 1.207 - 1.289
Strain_B 2.44157499180529 2.18863361547763 352 1.896 - 2.673
Strain_C 1.73514601963546 14.9533492822967 824 1.642 - 1.841

Thanks again. Shriram

aemiol commented 4 years ago

To your first point, yes. A unique SNP is determined using the provided genomes.

In Supplementary Fig 5 of our paper, we simulated the effect of SNPs on accuracy and found SNPs > 100 is generally ok. However, we emphasized a 5x coverage cutoff for microbes with high within-species genetic diversity. I see strain_B has a wide SMEG range which suggests it may be impacted by coverage. You could consider concatenating both reads1 and 2 to increase coverage

ShriramHPatel commented 4 years ago

Thanks much appreciated your help. That means any SMEG value having less supported coverage value should be consider carefully.

And pardon for my ignorance, But can you clarify more on concatenating. is it simply cat R1+R2 (single end read; which is also default) or pairing R1 - R2.

aemiol commented 4 years ago

Yes, cat R1.fastq R2.fastq > Reads.fastq. Then use that as input for SMEG

ShriramHPatel commented 4 years ago

Many thanks for your valuable support.

ShriramHPatel commented 4 years ago

Heya,

I just have some follow-up questions, so thought to re-open the issue. Hope you don't mind.

Coming back to the in-vitro experiment of S. epidermidis and antibiotics i.e. Fig 3B (right side), plot displays coverage ratios of two strains at different timepoint and it really does make sense. May I know is it SNP coverage generally provided in the SMEG output?

Actually we have a live bacterial count and metagenomic data on the same samples. And we do see increase in SMEG with increase in live counts of bacteria (which is good) but value of SMEG remains relatively compact. What I want to say is SMEG increases from 1.2 to 1.4 when bacterial count gets almost double in population. And in the same scenario we do see marked increase in SNP coverage in doubled bacterial cell population. Do you think in such cases it is good to report SNP coverage along with SMEG?

And regarding number of genomes needed for database building, what if strains of interest has 100 genomes in NCBI (and majority of them are incomplete or MAGs). In that case is it good idea to include all genomes or including 20 complete genomes would be sufficient?

Many thanks,

Shriram

aemiol commented 4 years ago

Yes, in our S. epidermidis analyses, we reported SNP coverage from SMEG output.

Sure, you can report SMEG scores along with SNP coverage depending on the message you want to convey. SMEG increase from 1.2 to 1.4 suggest an increase in the proportion of replicating population from 20% to 40% (assuming your strains lack multifork replication). So the values may not be as compact as you think especially since you see a correlation between SMEG scores and live counts.

Lastly, I wouldn't recommend using incomplete MAGs for database building. Since the number of genomes are less than 500, I recommend DB with as many high quality genomes as possible