Open robmaz opened 6 years ago
@robmaz - doing that sounds like you will keep a lot of not needed information while uploading (such a duplicated read names, and quality headers that contains nothing).
Are you sure that this tabbed format is desirable? It might increase the size of the files, and ReadTools have already implemented a parser for the current distmap format. I think that the proper way to go is to directly have BAM files in HDFS: the input format can be handled by Hadoop-BAM and there are even splitting indexes for BAM files to fast split of records in the same partition...
I think with any compressor this unneeded lines will basically disappear anyway, and I think it is desirable to easily generate and read the input stored in hdfs with alternative tools.
While I agree that going for BAMs might be preferable in the long run, let me highlight the two major current issues:
The tabbed fastq thing in contrast works already, and to get mc4 support in I just need to restart the cluster at some point.
In any case, this is a minor feature and certainly not urgent.
I see your point, and it might be good to have the tabbed input sooner than later if you want to move to an easier to manage format. Regarding the issues with BAM:
bwa mem
does not support BAM input, the InputFormat class can still be the SAM/BAM abstraction from Haddop-BAM. As it is done now, with the custom distmap format, the map task in the MapReduce pipeline is converting the input into a format understandable by the mapper. Thus, the headerless-bam part passed from Hadoop-BAM can be converted to FASTQ (using for example Picard or samtools) and then passed to the mapper; afterwards, Picard can merge the unaligned input with the aligned output, and thus keep all the information.The GATK team does not care about the bwa mem implementation; in their pre-GATK4 pipeline they were converting to FASTQ before mapping and then merging with Picard; after GATK4 they have a Spark version from unmapped BAM to mapped reads with a JNI for bwa.
In addition, GATK is using the Hadoop-BAM library for almost every HDFS/Spark related piece of code, so that means that it can work properly here too.
The real way of moving forward the Distmap to accept any kind of input and do it in a consistent way is to implement the framework in ReadTools. Or maybe better, implement a Sparkified version of Distmap in ReadTools. Anyway, that it is much more difficult than just changing the input format If you really think that it will be useful, I can spend some of my free-developmental time to implement a hidden and advance argument for output this new format (--new-format
).
@robmaz - one question about this: what will happen if the FASTQ that is parsed by paste
contains already tabs? This will fail!
I am looking into this and also into the 4mc codec...
I don't think the sequence or the quality string can contain tabs, but maybe the descriptions in Lines 1 and 3? Fortunately none of my fastqs had them. Anyway, the obvious solution would be to mask them before pasting and unmasking them after breaking, and I guess the current format must solve the same problem.
Usually, in the first line is not recommended to have them (and ReadTools is passing it to a comment tag), but in the quality header I found a lot of weird stuff in the past. The current format does not need to mask any of that.
Regarding the masking, do you mean to percent-encode the tabs in the quality header? That will be a proper way of handling it, but then it requires post-processing anyway (and thus, the major improvement of easy-to-parse does not hold...)
@robmaz - do you think that it is really important to keep the quality header in this case? What's about using @{readName}\t{sequence}\t+\t{quality}
for single end and @{readName1}\t{sequence1}\t+\t{quality1\t@{readName2}\t{sequence2}\t+\t{quality2}}
, ignoring the quality header?
This is still having the same advantages of using paste
, but without including unnedded information in the quality header. With ReadTools, it is ensure that the read name will contain the information required for processing (Illumina-encoded name, using #
to separate barcode and /
to separate pair-end information).
If this is good enough for you, I can re-factor the code in the distmap output to detect the .tfq
extension and write down that format; otherwise, it will output the current one. Another option is a extension-independent format and an advance parameter. Let me know what is better for your case...
Not at all, since this is only used to construct input files for the mappers. But what I now did alternatively after you warned me of the tabs was to change all tabs to spaces before paste-ing it, i.e.,
... | tr '\t' ' ' | paste - - - - - - - - ...
It is not very difficult to produce the format you suggest in a shell one-liner, but I think this is even simpler.
Cheers Rupert
2018-02-18 13:13 GMT+01:00 Daniel Gómez-Sánchez notifications@github.com:
@robmaz https://github.com/robmaz - do you think that it is really important to keep the quality header in this case? What's about using @{readName}\t{sequence}\t+\t{quality} for single end and @{readName1}\t{sequence1}\t+\t{quality1\t@{readName2}\t{ sequence2}\t+\t{quality2}}, ignoring the quality header?
This is still having the same advantages of using paste, but without including unnedded information in the quality header. With ReadTools, it is ensure that the read name will contain the information required for processing (Illumina-encoded name, using # to separate barcode and / to separate pair-end information).
If this is good enough for you, I can re-factor the code in the distmap output to detect the .tfq extension and write down that format; otherwise, it will output the current one. Another option is a extension-independent format and an advance parameter. Let me know what is better for your case...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/magicDGS/ReadTools/issues/404#issuecomment-366511693, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_FfJFxME4DyPk-i8qyEqIO6TGdQQCmks5tWBPigaJpZM4SFGha .
But I am not using the shell in ReadTools, but pure java and my framework. What ReadTools is doing internally for every SAM/FASTQ file is the following:
I think that because the main idea for using paste is to speed-up stuff, it is better to just generate from the internal object a "minimal-tfq" without the quality header, which isn't use at all for any mapper (as far as I am concerned). This will also help to reduce the filesize if not compressed (or even if compressed). I can imagine huge files after converting a BAM file with CO that are large, but not related with the quality header...
ReadTools already move the quality header to the comment tag, and thus it is easier to do not include the quality header.
Sure, do it like this then. I just like the idea of having a shell "reference implementation", but I think I can replicate the format with a tiny sed script.
Cheers Rupert
2018-02-19 13:06 GMT+01:00 Daniel Gómez-Sánchez notifications@github.com:
But I am not using the shell in ReadTools, but pure java and my framework. What ReadTools is doing internally for every SAM/FASTQ file is the following:
- Convert to a consistent object: read name without spaces and stripping barcode/pair-end information, BC tag for barcode extracted from read name, CO tag for quality header comments (only FASTQ)
- For the distmap format, it just grab the needed information: read name, sequence and quality.
- For the FASTQ output, it converts again the header to the Illumina format: appends BC and pair-end information to read name, and regenerates the quality header from the CO tag.
I think that because the main idea for using paste is to speed-up stuff, it is better to just generate from the internal object a "minimal-tfq" without the quality header, which isn't use at all for any mapper (as far as I am concerned). This will also help to reduce the filesize if not compressed (or even if compressed). I can imagine huge files after converting a BAM file with CO that are large, but not related with the quality header...
ReadTools already move the quality header to the comment tag, and thus it is easier to do not include the quality header.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/magicDGS/ReadTools/issues/404#issuecomment-366672068, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_FfI99oFqoHp5Gd5UCC9E1HuO2zY6dks5tWWPagaJpZM4SFGha .
I believe that most of the FASTQ that are handled here does not contain any quality header; and anyway, the reference implementation can still be the tr | paste
pipe because it shouldn't change the result (and if it does, it means that we require to include it to ReadTools).
Sooner or later I would also like to move from Ram's format where uninformative fastq lines are removed before putting the records on one line to my more straight-forward "tabbed fastq" format where nothing is removed and records are jammed together like this (for paired-end):
samtools fastq $b | paste - - - - - - - - | hadoop fs -put - ${b%.bam}.tfq
(or ... | mc4 | hadoop fs ... ....tfq.mc4 with mc4 compression.) Which is much less hassle to handle through command line tools.
While readtools support for this is not so urgent, maybe it would not be so difficult to already include an option to produce this format?