milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
325 stars 79 forks source link

Add in parser for EOL characters #293

Closed xww52526 closed 6 years ago

xww52526 commented 6 years ago

Hi,

Thanks for this great software.

I have a request for a feature to include - could you consider add in file parser for end of line (EOL) characters? My project requests split original FASTQ files into several small ones with different barcode sequences. If processed by windows machine, the EOL characters would be changed from \n of unix to \r\n; which raised error of "unknown letter 13" when feed to mixcr for alignment (13 is ASCII number for \r). Is it possible to add in EOL character parser? Then no matter on which OS intermediate FASTQs were generated, they could always be taken by mixcr?

Thanks very much!

dbolotin commented 6 years ago

Hi,

unfortunately, some time ago we decided against this feature, and decided to support exclusively \n line breaks. This is partially because of performance implications (internally MiXCR uses two-staged FASTQ parser, which spread actual parsing load across multiple threads, this allows to achieve throughputs up to several gigabyte per second, which is important to achieve best possible overall performance on modern multiprocessor nodes (with > 50 cores), and while this is possible to implement CRLF support, it will unnecessary complicate the code) , and partially to support standardisation of FASTQ format, accumulation of such small variations, complicates development of a new software.

Fortunately it is very simple to convert line breaks, or to write processing script in a way that it will produce consistent \n line break in the first place (which I encourage you to do).

MiXCR 2.2 will output more informative error message for this case.

All the best, Dmitry.