Closed robmaz closed 6 years ago
You're right, ultimately we don't want the mappers in the distmap distribution at all, so I am basically in camp #2. In fact I plan to put a similar libexec/ folder into the hdfs, and have distmap pick default mappers from there unless binaries are specified on the command line, which is, as you say, the obviosu thing that should happen. But this requires a bit of additional logic (and a minor api change in making the path to the mapper optional) that is not yet there and will take more time, so I don't want to do it now. But I don't want to remove them either; people may currently use a certain mapper version in their project and want to keep using it for consistency, and then they can't find the source anymore, or will not be able to compile it, and bother me with it - pointing them to the new folder location is as much effort as I want to invest into supporting that. This is why the folders need to remain for now.
One big reason why I want to rename all folders now (in addition to making it more fhs-conforming and thus installer-friendly) now is to make sure I catch all occurrences of calling external scripts or jars or binaries. It is a great way to check that no "hidden" calls remain and I have fully understood the logic of what distmap currently does.
What I want to do after this renaming is finish the homebrew formula, then I can install it and try the whole pipeline on both Mac and Linux cluster with my Wolbachia data. This will probably reveal all remaining issues, which I will then fix as they pop up until this all works again. Then we can move 3.1 into beta and I will put it on the clusters and announce it to popgen to try out. Only then should we introduce actual new features.
2017-12-05 10:43 GMT+01:00 Daniel Gómez-Sánchez notifications@github.com:
Assigned #45 https://github.com/robmaz/distmap/pull/45 to @robmaz https://github.com/robmaz.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/robmaz/distmap/pull/45#event-1372250210, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_FfDNPbNIJ7LdLUbc7HpWpyEarGTXrks5s9RBdgaJpZM4Q1-Vq .
@robmaz - Ok, now I see your plan for distmap. But I wonder about two different issues that you brought here:
Anyway, up to you if you want to merge this PR. Looks good, but I will remove the source and keep only the binaries in the case that it is possible.
Supporting the new ReadTools fully and without inconsistencies is not a new feature, we already had a PR about that. Which was not all of what needed to be done, but the remaining issues I will also iron out during the testing.
However, I would like to keep open the possibility of cluster trimming in principle. Trimming is inherently very parallelizable, and if you trim 100 GB bams locally, that also takes a couple hours, I think, whereas it would be a 10 minutes job on the cluster. Didn't you think about making a smaller readtools that we could also use on the cluster?
I'll remove src stuff and merge this.
Making a smaller ReadTools requires to do not pack some java dependencies, so there is no plan to do so yet (at least until GATK separate artifacts for easier picking dependencies - I've already open an issue for them).
In addition, supporting trimming in the cluster might speed up by parallelization, but:
I believe that trimming in the cluster produces more nightmares than advantages, and DistMap is mainly designed for mapping (trimming was an additional feature required in the lab because locally was really slow with the perl script). If we want to go in the direction of supporting MapReduce jobs for all the steps in our pipeline, I will prefer to implement a distmap-like functionality directly in ReadTools, or a sparkified version of it. For example, a MapReduce class for trimming will be quite easy to implement promatically for any kind of input (distmap) FASTQ/SAM/BAM/CRAM), although pair-end will be trickier except for the distmap input.
Finally the big move. After removing some unused archives from Linux_executables/, rename Linux_executables to libexec/linux/ and executables/ to libexec/macos/. This may introduce a temporary issue with not finding samtools, which I will fix when it manifests. (Maybe by dropping samtools altogether as suggested in #9, also to be seen in connection with #38). The mappers have no default location; it needs to be specified on the command line anyway.