tfq support for distmap upload

robmaz commented 6 years ago

Sooner or later I would also like to move from Ram's format where uninformative fastq lines are removed before putting the records on one line to my more straight-forward "tabbed fastq" format where nothing is removed and records are jammed together like this (for paired-end):

samtools fastq $b | paste - - - - - - - - | hadoop fs -put - ${b%.bam}.tfq

(or ... | mc4 | hadoop fs ... ....tfq.mc4 with mc4 compression.) Which is much less hassle to handle through command line tools.

While readtools support for this is not so urgent, maybe it would not be so difficult to already include an option to produce this format?

magicDGS commented 6 years ago

@robmaz - doing that sounds like you will keep a lot of not needed information while uploading (such a duplicated read names, and quality headers that contains nothing).

Are you sure that this tabbed format is desirable? It might increase the size of the files, and ReadTools have already implemented a parser for the current distmap format. I think that the proper way to go is to directly have BAM files in HDFS: the input format can be handled by Hadoop-BAM and there are even splitting indexes for BAM files to fast split of records in the same partition...

robmaz commented 6 years ago

I think with any compressor this unneeded lines will basically disappear anyway, and I think it is desirable to easily generate and read the input stored in hdfs with alternative tools.

While I agree that going for BAMs might be preferable in the long run, let me highlight the two major current issues:

Hadoop-BAM has no release for Hadoop 2.7.x, only 2.2. It requires building, probably messy like everything Hadoop-related, and may not even work.
Since bwa mem does not support BAM input, we'd need a custom InputFormat class, which we currently don't have. This also needs to take care of the header issue (I think you said that headers may be required in the near future? It is curious that the BAM-heavy GATK people who also use bwa mem as their default mapper have not resolved this issue, so maybe something is coming from there?)

The tabbed fastq thing in contrast works already, and to get mc4 support in I just need to restart the cluster at some point.

robmaz commented 6 years ago

In any case, this is a minor feature and certainly not urgent.

magicDGS commented 6 years ago

I see your point, and it might be good to have the tabbed input sooner than later if you want to move to an easier to manage format. Regarding the issues with BAM:

The GATK team uses Hadoop-BAM and they are using Hadoop >= 2.7, and they are just pulling the library from maven central. So probably there is no issue for using it with the current version. In addition, there is now a discussion in which I am involved about the governance of Hadoop-BAM and how to proceed with its development. Thus, it looks promising for the developmental of Distmap.
Although bwa mem does not support BAM input, the InputFormat class can still be the SAM/BAM abstraction from Haddop-BAM. As it is done now, with the custom distmap format, the map task in the MapReduce pipeline is converting the input into a format understandable by the mapper. Thus, the headerless-bam part passed from Hadoop-BAM can be converted to FASTQ (using for example Picard or samtools) and then passed to the mapper; afterwards, Picard can merge the unaligned input with the aligned output, and thus keep all the information.

The GATK team does not care about the bwa mem implementation; in their pre-GATK4 pipeline they were converting to FASTQ before mapping and then merging with Picard; after GATK4 they have a Spark version from unmapped BAM to mapped reads with a JNI for bwa.

In addition, GATK is using the Hadoop-BAM library for almost every HDFS/Spark related piece of code, so that means that it can work properly here too.

The real way of moving forward the Distmap to accept any kind of input and do it in a consistent way is to implement the framework in ReadTools. Or maybe better, implement a Sparkified version of Distmap in ReadTools. Anyway, that it is much more difficult than just changing the input format If you really think that it will be useful, I can spend some of my free-developmental time to implement a hidden and advance argument for output this new format (--new-format).

magicDGS commented 6 years ago

@robmaz - one question about this: what will happen if the FASTQ that is parsed by paste contains already tabs? This will fail!

I am looking into this and also into the 4mc codec...

robmaz commented 6 years ago

I don't think the sequence or the quality string can contain tabs, but maybe the descriptions in Lines 1 and 3? Fortunately none of my fastqs had them. Anyway, the obvious solution would be to mask them before pasting and unmasking them after breaking, and I guess the current format must solve the same problem.

magicDGS commented 6 years ago

Usually, in the first line is not recommended to have them (and ReadTools is passing it to a comment tag), but in the quality header I found a lot of weird stuff in the past. The current format does not need to mask any of that.

Regarding the masking, do you mean to percent-encode the tabs in the quality header? That will be a proper way of handling it, but then it requires post-processing anyway (and thus, the major improvement of easy-to-parse does not hold...)

magicDGS commented 6 years ago

@robmaz - do you think that it is really important to keep the quality header in this case? What's about using @{readName}\t{sequence}\t+\t{quality} for single end and @{readName1}\t{sequence1}\t+\t{quality1\t@{readName2}\t{sequence2}\t+\t{quality2}}, ignoring the quality header?

This is still having the same advantages of using paste, but without including unnedded information in the quality header. With ReadTools, it is ensure that the read name will contain the information required for processing (Illumina-encoded name, using # to separate barcode and / to separate pair-end information).

If this is good enough for you, I can re-factor the code in the distmap output to detect the .tfq extension and write down that format; otherwise, it will output the current one. Another option is a extension-independent format and an advance parameter. Let me know what is better for your case...

robmaz commented 6 years ago

Not at all, since this is only used to construct input files for the mappers. But what I now did alternatively after you warned me of the tabs was to change all tabs to spaces before paste-ing it, i.e.,

... | tr '\t' ' ' | paste - - - - - - - - ...

It is not very difficult to produce the format you suggest in a shell one-liner, but I think this is even simpler.

Cheers Rupert

2018-02-18 13:13 GMT+01:00 Daniel Gómez-Sánchez notifications@github.com:

@robmaz https://github.com/robmaz - do you think that it is really important to keep the quality header in this case? What's about using @{readName}\t{sequence}\t+\t{quality} for single end and @{readName1}\t{sequence1}\t+\t{quality1\t@{readName2}\t{ sequence2}\t+\t{quality2}}, ignoring the quality header?

This is still having the same advantages of using paste, but without including unnedded information in the quality header. With ReadTools, it is ensure that the read name will contain the information required for processing (Illumina-encoded name, using # to separate barcode and / to separate pair-end information).

If this is good enough for you, I can re-factor the code in the distmap output to detect the .tfq extension and write down that format; otherwise, it will output the current one. Another option is a extension-independent format and an advance parameter. Let me know what is better for your case...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/magicDGS/ReadTools/issues/404#issuecomment-366511693, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_FfJFxME4DyPk-i8qyEqIO6TGdQQCmks5tWBPigaJpZM4SFGha .

magicDGS commented 6 years ago

But I am not using the shell in ReadTools, but pure java and my framework. What ReadTools is doing internally for every SAM/FASTQ file is the following:

Convert to a consistent object: read name without spaces and stripping barcode/pair-end information, BC tag for barcode extracted from read name, CO tag for quality header comments (only FASTQ)
For the distmap format, it just grab the needed information: read name, sequence and quality.
For the FASTQ output, it converts again the header to the Illumina format: appends BC and pair-end information to read name, and regenerates the quality header from the CO tag.

I think that because the main idea for using paste is to speed-up stuff, it is better to just generate from the internal object a "minimal-tfq" without the quality header, which isn't use at all for any mapper (as far as I am concerned). This will also help to reduce the filesize if not compressed (or even if compressed). I can imagine huge files after converting a BAM file with CO that are large, but not related with the quality header...

ReadTools already move the quality header to the comment tag, and thus it is easier to do not include the quality header.

robmaz commented 6 years ago

Sure, do it like this then. I just like the idea of having a shell "reference implementation", but I think I can replicate the format with a tiny sed script.

Cheers Rupert

2018-02-19 13:06 GMT+01:00 Daniel Gómez-Sánchez notifications@github.com:

But I am not using the shell in ReadTools, but pure java and my framework. What ReadTools is doing internally for every SAM/FASTQ file is the following:

Convert to a consistent object: read name without spaces and stripping barcode/pair-end information, BC tag for barcode extracted from read name, CO tag for quality header comments (only FASTQ)

For the distmap format, it just grab the needed information: read name, sequence and quality.

For the FASTQ output, it converts again the header to the Illumina format: appends BC and pair-end information to read name, and regenerates the quality header from the CO tag.

I think that because the main idea for using paste is to speed-up stuff, it is better to just generate from the internal object a "minimal-tfq" without the quality header, which isn't use at all for any mapper (as far as I am concerned). This will also help to reduce the filesize if not compressed (or even if compressed). I can imagine huge files after converting a BAM file with CO that are large, but not related with the quality header...

ReadTools already move the quality header to the comment tag, and thus it is easier to do not include the quality header.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/magicDGS/ReadTools/issues/404#issuecomment-366672068, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_FfI99oFqoHp5Gd5UCC9E1HuO2zY6dks5tWWPagaJpZM4SFGha .

magicDGS commented 6 years ago

I believe that most of the FASTQ that are handled here does not contain any quality header; and anyway, the reference implementation can still be the tr | paste pipe because it shouldn't change the result (and if it does, it means that we require to include it to ReadTools).

magicDGS / ReadTools

tfq support for distmap upload #404