milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
323 stars 78 forks source link

export internal DB sequence? #126

Closed bbimber closed 7 years ago

bbimber commented 8 years ago

hello,

i am interesting in comparing the internal human DB used in MiXCR to another species. i did not see anything in the docs describing how to export from the internal DB to any sort of text format. is this possible, or are the sequences available somewhere?

thanks.

dbolotin commented 8 years ago

Hi,

please see here and here.

the repseqio/library contain our current built-in library (this commit).

I just made release of repseqio 1.1.0, and it is bundled with the very same library (it automatically compiles all json files from linked library repo using git submodule on each maven build).

Just install repseqio, and type:

repseqio fasta -s hs -g VRegion default VGenes.fasta

Here is the help for the command:

Export sequences of genes to fasta file.
Usage: fasta [options] input_library.json|default [output.fasta]
  Options:
    -c, --chain
       Chain pattern, regexp string, all genes with matching chain record will
       be exported.
    -f, --force
       Force overwrite of output file(s).
  * -g, --gene-feature
       Gene feature to export (e.g. VRegion, JRegion, VTranscript, etc...)
    -h, --help
       Displays help for this command.
       Default: false
    -n, --name
       Gene name pattern, regexp string, all genes with matching gene name will
       be exported.
    -s, --species
       Species name, used in the same way as --taxon-id.
    -t, --taxon-id
       Taxon id (filter multi-library file to leave single library for specified
       taxon id)

You can also clone library repo and export sequences directly from json files there:

git clone https://github.com/repseqio/library.git
cd library/human/
repseqio fasta -g VGene TRB.json TRB.fasta

This, among other things, will automatically download required sequences from GenBank (to ~/.repseqio folder).

If you will notice errors in positions of anchor points, absence of some segments or any issues in the library, it would be great if you could send an issue or a pull request to repseqio/library repo.

P.S. Current development version of MIXCR already supports repseq.io-slyle libraries:

mixcr align --library my_lib.json ...

If you wish, I can build a jar for you, and send it by email. Or you can build it yourself (create a folder and copy-paste this to terminal, should work if you have git and maven):

git clone https://github.com/repseqio/repseqio.git
cd repseqio
git submodule init
git submodule update
cd milib
mvn clean install -DskipTests
cd ..
mvn clean install -DskipTests
cd ..
git clone https://github.com/milaboratory/mixcr.git
cd mixcr
git checkout feature/repseq.io
mvn clean install -DskipTests
./mixcr -v

Now you can use mixcr script from cloned mixcr folder.

bbimber commented 8 years ago

great, thanks.

bbimber commented 7 years ago

Hello,

I'm hoping you could give advice on updating the TCR DB for another species and defining coordinates to make your JSON format. I exported your internal DB. I was planning to BLAST against the appropriate chromosome(s) and annotate using that. However, I can export all sorts of permutations of the V/D/J regions. Is this how you would recommend doing this, and what specific mixcr segments would you recommend using for BLASTing?

Thanks for any help.