milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
334 stars 79 forks source link

MIXCR-test D-gene not recovered #207

Closed darth-donut closed 7 years ago

darth-donut commented 7 years ago

MiXCR-test's documentation gave an example of exportSynth whereby the V, D and J segments were exported. However, even by following the example verbatim, I could not reproduce the same output as the example (i.e. D gene (and even C gene) was not shown) see below.

Additionally, igBLAST (or rather, MIGMAP)'s output shows that over 94% of my generated sequences are non coding (i.e. has frameshifts/stop codons). Is there any argument to adjust this parameter during sequence generation?

Output

screen shot 2017-02-08 at 11 34 44 am

Commands

Ran verbatim from the example

mixcr-test generate -n 10000 -c 1000 -l IGH data.synth
mixcr-test export data.synth data.fastq.gz
mixcr-test exportSynth data.synth data.descr
dbolotin commented 7 years ago

Hi,

First of all, I want to warn you that MiXCR-Test pipeline was not developed for the use outside the original paper's context, and it can't in any way be considered as a general tool for generation of synthetic RepSeq data.

We made binaries and source code available solely to make it possible for anybody to validate our benchmarking procedures.

Documentation was written some time after all the benchmarks were made, based on the code we already changed for other optimisation uses, so we missed the fact that new fields were introduced into the output. The data is still there in data.synth file, even using 1.1 binary, and it is used to calculate fraction of misinterpreted D genes, in mixcr-align, igblast, etc.. actions, it's gust not exported to txt file. Sorry for the out of sync docs!

Here is the latest version of MiXCR-Test I can get from our archive, it outputs D genes as described in the docs, but I can't guarantee that it works the same as binary published in the paper (though, most probably, it should): http://files.milaboratory.com/mixcr/paper/mixcr-test-1.2-SNAPSHOT.jar

Concerning the last part of your question: unfortunately, there is no such filtering option to leave only in-frame sequences without stops.

Have a look at http://yana-safonova.github.io/ig_simulator/ by @yana-safonova . To date, it is the only RepSeq simulation tool that was published in a dedicated paper. The paper suggests that it aims to become a widely adopted tool for this purpose, which implies more or less long term support. For me it looks like the best option for the task (after reinventing the wheel, of course :wink:).

Best, Dmitry.

darth-donut commented 7 years ago

Hi Dmitry, Cheers! I'm assuming it's safe to use the exportSynth routine from 1.2-SNAPSHOT on 1.1 data.synth file? (I did get a java.lang.RunTimeException: wrong reference id). Is the 1.1 generated binary file not compatible with 1.2-SNAPSHOT? It would be great if I can salvage the D gene from 1.1 binary file with some sort of parser.

In fact, I did have a look at ig_simulator before stumbling on your mixcr-test simulator. Unfortunately, the same problem of non-coding clonotypes being too diverse(>90%) is also prominent in ig_simulator. I guess it's difficult to introduce indels without messing up the productivity. Anyhow, thanks for the updated tool. Harry

dbolotin commented 7 years ago

Hi Harry,

Data from 1.1 is not compatible with 1.2-SNAPSHOT, because the binary format produced by underlying serialization library had changed. Unfortunately, there is no way to convert data from one format to another. Of course, you can regenerate data using 1.2 from scratch.

Dmitry.

darth-donut commented 7 years ago

Noted, cheers!