GRiD on MAGs.fa - Githubissues

mldillon-LBL commented 5 years ago

Hello,

I would like to use GRiD on MAGs I've generated, which are in fasta format. I am considering generating fake qual scores to convert to fastq so that I can use this tool (e.g., using seqtk https://www.biostars.org/p/344232/), but I would like to know more about how GRiD uses the qual scores before proceeding. Also, do you have any plans to make this tool compatible with .fa files?

Thanks very much,

Megan

ohlab commented 5 years ago

Hi Megan @mldillon-LBL , Yes its fine to convert fasta to fastq with fake qual scores as GRiD relies on the default bowtie2 read filtering step. However, your genomes (MAGs in this case) are expected to be in fasta format. Are you thinking of generating mock reads? What about the reads used in assembling the MAGs?

Thanks

Tunde

franciscozorrilla commented 5 years ago

Hello, thank you for developing such a nice tool! I have further questions regarding the use of GRiD on MAGs.

I would have intuitively thought that to obtain GRiD scores for a number of reconstructed MAGs originating from a particular sample, one should have to use the grid multiplex option. This is perhaps also the source of @mldillon-LBL 's question/confusion. However after looking at the examples, it looks like the grid multiplex option does not support/allow MAGs as inputs, unless one artificially converts them to .fastq format.

Could you confirm that the more appropriate method of obtaining GRiD scores for MAGs is to use the grid single option?
Is it possible to provide multiple MAGs to the grid single option in order to generate the heatmap output of grid multiplex?
I have ~30 MAGs/sample, with ~140 samples in total, would you say that the best way to characterize each community is by running grid single -r SAMPLE_X_FOLDER -e .fastq.gz -g MAG_Y_FROM_SAMPLE_X.fasta for each MAG (Y) originating from each sample (X)?
Or would it be possible/easier to generate a custom database from my reconstructed MAGs? In this case would I generate one database per each sample/captured set of MAGs, or would I generate one large database for ALL the samples/MAGs?

Thanks and best wishes, Francisco

aemiol commented 5 years ago

Hi @franciscozorrilla , this is a great question and the solutions you proffer for both the "grid single" and "grid multiplex" are fine. The multiplex module would be easier in your case. Unfortunately, heatmaps are only generated with the multiplex module so you may want to consider your last suggestion of generating custom databases. I'm assuming that a MAG in one sample may also be present in other samples, right? If yes, then you can simply generate a large non-redundant MAG database.

Cheers, Tunde

franciscozorrilla commented 5 years ago

Hi @aemiol thanks for the response! I am trying out of few different ways of running GRiD, so far I have tried:

Creating sample-specific databases using the reconstructed MAGs from each sample, and then running the grid multiplex module.
Using the Stool database with the grid multiplex module.

For both of these approaches I used pathoscope and 0.2 coverage. Next, I am trying the following approach:

Create one large database with all MAGs generated from all samples, then running grid multiplex.

However, I run into this problem when attempting to create the database.

Error: Reference sequence has more than 2^32-1 characters! Please build a large index by passing the --large-index option to bowtie2-build Error: Encountered internal Bowtie 2 exception (#1)

I tried passing the --large-index flag to update_database as suggested by the error message, but it seems like the flag is not recognized and the job errors out. I am currently trying again, but this time I modified line 155 of the update_database script to:

bowtie2-build $NAME $DBR/BOWTIE_$NAME --large-index --threads 32

I am just wondering if you foresee any problems with this modification? Will the database still be usable by GRiD with the --large-index option?

Regarding your suggestion:

I'm assuming that a MAG in one sample may also be present in other samples, right? If yes, then you can simply generate a large non-redundant MAG database.

In theory yes, some MAGs will correspond to species that are common in multiple samples, but I am trying to capture as much sample-specific strain level variation, so perhaps I am better off using approach number 1 that I listed above?

After looking at the results obtained using approach number 1, it appears that the coverage for some of my MAGs is below x0.2, since it looks like in general I have more MAGs per sample than result entries in each of my text output files. Would you suggest lowering the coverage cutoff? Or will GRiD yield very unreliable results if I lower the coverage threshold?

Thanks and best wishes, Francisco

nigiord commented 5 years ago

@franciscozorrilla I think what you encounter is a bowtie2 bug related to the -q option used by the GRiD script: https://github.com/BenLangmead/bowtie2/issues/245#issuecomment-485769090

Someone was assigned to fix it. By the meantime a workaround is indeed to edit line 155 and either remove -q option or force the use of a large index with --large-index (both should work too).

I am just wondering if you foresee any problems with this modification? Will the database still be usable by GRiD with the --large-index option?

I do not see any problem that could arise from such a modification. The index size is just internal to bowtie2, this is transparent for the SAM files generated, hence for GRiD itself.

Cheers, Nils

ohlab / GRiD

GRiD on MAGs.fa #5