merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

obtaining all reads mapping to a given bin #173

Closed rikander closed 9 years ago

rikander commented 9 years ago

It would be lovely to be able to pull out all of the reads that mapped to a given bin so that you can do things further downstream like reassembly, or combining those reads with SAG reads, or whatever.

In general, having command-line access to the database containing the bins, the contigs, the splits, the reads, the average coverage per bin, average coverage per split, the normalization factors, etc. would make it much easier for us to access the data and then write our own scripts for whatever specific analyses we want to do. (This is presuming you have a SQL database containing this already, but maybe not-- I have no idea how hard this would be to implement.)

meren commented 9 years ago

Yes, the ability to recover all short reads that map to a particular bin will be an important feature going forward. I will do it, which will not be very easy :( But I will try.

Bin related info will always be reported best in the summary output. But the user has command line access to everything else. Say you want the mean coverage of each split across all samples? Do this:

anvi-get-db-table-as-matrix PROFILE.db -t mean_coverage_splits

Do you need tetranucleotide frequency matrix for some reason?

anvi-get-db-table-as-matrix ANNOTATION.db -t kmer_splits

Do you need "name, length, and the GC content of every contig with a GC content of less than 30%"?

sqlite3 ANNOTATION.db 'select contig,length,gc_content from contigs_basic_info where gc_content < 0.3' | sed 's/|/       /g'

You have access to everything.

Although it will take a bit of time to learn about the architecture of the platform to exploit it fully :(

I am hoping the initial paper will help with that to an extent, but I can see I am going to have try much harder than that to help people understand all the details better.

Thank you for the suggestions, Rika. I will start working on them as soon as I am back to a normal working schedule.

rikander commented 9 years ago

Actually, coverage of each split across all samples was one of the things I was hoping for. It's good to see some of these commands are already written (a user manual will help a lot!), and as you say, maybe a general overview of the structure of the annotation database will help us access the rest.

Thanks!

meren commented 9 years ago

If you run this, it will give you a list of tables you have in a given database:

anvi-get-db-table-as-matrix DATABASE --list

Then, you can export any of those tables as a matrix:

anvi-get-db-table-as-matrix DATABASE --table TABLE_NAME

And you can even specify which fields you are interested in:

$ anvi-get-db-table-as-matrix -h
usage: anvi-get-db-table-as-matrix [-h] [-t TABLE] [-l] [-f TABLE] [-o OUTPUT]
                                   DB

A script to export anvio tables as TAB-delimited matrices.

positional arguments:
  DB                    Database to read.

optional arguments:
  -h, --help            show this help message and exit
  -t TABLE, --table TABLE
                        Table name to export.
  -l, --list            Gives a list of tables in a database and quits. If a
                        table is already declared this time it lists all the
                        fields in a given table, in case you would to export
                        only a specific list of fields from the table using
                        --fields parameter.
  -f TABLE, --fields TABLE
                        Fields to report. USe --list parameter with a table
                        name to see available fields You can list fields using
                        this notation: --fields "field_1, field_2, ...
                        field_N"
  -o OUTPUT, --output OUTPUT
                        Output file path.
meren commented 9 years ago

OK. We have the first version of the client, anvi-get-short-reads-from-bam, to get short reads back from BAM files. It is pretty talented, but it needs some testing.

$ anvi-get-short-reads-from-bam --help
usage: anvi-get-short-reads-from-bam [-h] -p PROFILE_DB -a ANNOTATION_DB -c
                                     COLLECTION-ID [-b BIN_ID] [-B BINS FILE]
                                     [-o OUTPUT_FILE_NAME] [--debug]
                                     BAM FILE[S] [BAM FILE[S] ...]

Get short reads back from a BAM file.

positional arguments:
  BAM FILE[S]           BAM file(s) to access to recover short reads

optional arguments:
  -h, --help            show this help message and exit
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Profile database.
  -a ANNOTATION_DB, --annotation-db ANNOTATION_DB
                        Annotation database.
  -c COLLECTION-ID, --collection-id COLLECTION-ID
                        Collection id to look for bins.
  -b BIN_ID, --bin-id BIN_ID
                        A bin to analayze (you either use this, or use a file
                        to list each bin id you prefer to analyze).
  -B BINS FILE, --bin-ids-file BINS FILE
                        Text file for bins (each line should be a bin id).
  -o OUTPUT_FILE_NAME, --output-file-name OUTPUT_FILE_NAME
                        Output directory
  --debug               Extra outputs for debugging. I would not use it if I
                        were you.

And here is an example from the mini_test:

$ anvi-get-short-reads-from-bam -p test-output/204-MERGED/PROFILE.db \
                                                    -a test-output/ANNOTATION.db \
                                                    -c CONCOCT \
                                                    -b Bin_2 \
                                                    -o short_reads_for_Bin_2.fasta \
                                                    test-output/*bam

Input BAM file(s) ............................: 204-6M.bam, 204-7M.bam, 204-9M.bam
Collection ID ................................: CONCOCT
Bin(s) .......................................: Bin_2
Number of splits .............................: 17
Num reads stored .............................: 23,338
FASTA output .................................: short_reads_for_Bin_2

Please let me know if you end up testing it.

At this point we are only working with individual reads and stores a FASTA file, but I believe it is going to be critical to add support for paired-end reads and stores results in FASTQ files.

ghost commented 7 years ago

Hi Meren,

Does the current version of Anvi'o support storing as FASTQ file?

Thank you, Nancy

meren commented 7 years ago

No, unfortunately. It is rather easy to convert FASTA files to FASTQ files, and we never run into a situation we needed FASTQ files.

Why do you need it?

On Jun 21, 2017 7:23 PM, "Fungal" notifications@github.com wrote:

Hi Meren,

Does the current version of Anvi'o support storing as FASTQ file?

Thank you, Nancy

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/merenlab/anvio/issues/173#issuecomment-310240932, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMCu88FnMyljLkSzcqhtEWSrFWPpwFRks5sGbQBgaJpZM4FcFnV .

ghost commented 7 years ago

Hi Meren,

Thank you for your reply.

1) I was thinking to export short reads and then do re-assembly of some bins through metaspades. I have several different profiles of the same samples due to using spades and megahit. I'm not sure if this is ok, but I wanted to re-assemble similar bins (called by taxonomy) from these different profiles.

2) I'm going to refine some SAGs individually on Anvi'o and then export short reads to assemble on spades (also will do co-assembly on spades later).

What do you think? I think the main issue is that spades uses only fastq. I got this error when I tried today for my metagenome samples.

Thanks! Nancy

Sent from my iPhone

On Jun 21, 2017, at 5:44 PM, A. Murat Eren notifications@github.com wrote:

No, unfortunately. It is rather easy to convert FASTA files to FASTQ files, and we never run into a situation we needed FASTQ files.

Why do you need it?

On Jun 21, 2017 7:23 PM, "Fungal" notifications@github.com wrote:

Hi Meren,

Does the current version of Anvi'o support storing as FASTQ file?

Thank you, Nancy

― You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/merenlab/anvio/issues/173#issuecomment-310240932, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMCu88FnMyljLkSzcqhtEWSrFWPpwFRks5sGbQBgaJpZM4FcFnV .

― You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.