robmaz / distmap

Sequence alignment on Hadoop
0 stars 1 forks source link

Move stuff to libexec #45

Closed robmaz closed 6 years ago

robmaz commented 6 years ago

Finally the big move. After removing some unused archives from Linux_executables/, rename Linux_executables to libexec/linux/ and executables/ to libexec/macos/. This may introduce a temporary issue with not finding samtools, which I will fix when it manifests. (Maybe by dropping samtools altogether as suggested in #9, also to be seen in connection with #38). The mappers have no default location; it needs to be specified on the command line anyway.

robmaz commented 6 years ago

You're right, ultimately we don't want the mappers in the distmap distribution at all, so I am basically in camp #2. In fact I plan to put a similar libexec/ folder into the hdfs, and have distmap pick default mappers from there unless binaries are specified on the command line, which is, as you say, the obviosu thing that should happen. But this requires a bit of additional logic (and a minor api change in making the path to the mapper optional) that is not yet there and will take more time, so I don't want to do it now. But I don't want to remove them either; people may currently use a certain mapper version in their project and want to keep using it for consistency, and then they can't find the source anymore, or will not be able to compile it, and bother me with it - pointing them to the new folder location is as much effort as I want to invest into supporting that. This is why the folders need to remain for now.

One big reason why I want to rename all folders now (in addition to making it more fhs-conforming and thus installer-friendly) now is to make sure I catch all occurrences of calling external scripts or jars or binaries. It is a great way to check that no "hidden" calls remain and I have fully understood the logic of what distmap currently does.

What I want to do after this renaming is finish the homebrew formula, then I can install it and try the whole pipeline on both Mac and Linux cluster with my Wolbachia data. This will probably reveal all remaining issues, which I will then fix as they pop up until this all works again. Then we can move 3.1 into beta and I will put it on the clusters and announce it to popgen to try out. Only then should we introduce actual new features.

2017-12-05 10:43 GMT+01:00 Daniel Gómez-Sánchez notifications@github.com:

Assigned #45 https://github.com/robmaz/distmap/pull/45 to @robmaz https://github.com/robmaz.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/robmaz/distmap/pull/45#event-1372250210, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_FfDNPbNIJ7LdLUbc7HpWpyEarGTXrks5s9RBdgaJpZM4Q1-Vq .

magicDGS commented 6 years ago

@robmaz - Ok, now I see your plan for distmap. But I wonder about two different issues that you brought here:

  1. If the only purpose of keeping binaries here is to keep the support to people using old mappers, it will be enough if only binaries are in the folders. Keeping an outdated bwa source code hosted here is bad for two reasons: 1) we are distributing source code that we are not supporting, and 2) it adds a lot of unused files that does not make sense to host here. I will vote to remove everything that is source and keep only binaries/needed scripts. This will reduce the complexity of the folders and repository, while keeping the mappers binaries to be used by users demanding outdated software (which is bad anyway)
  2. I do not agree with the "new features" part of your developmental strategy. First, because I'm not sure what is for you a "new feature". Two things should be done before beta, for my point of view, to avoid misunderstanding for the users (and support nightmares): 1) trimming in the cluster should be completely removed; 2) the "only-download" and "only-merge" functionality should be deprecated and removed. The tasks for trimming and only-donwload and only-merge should log a warning saying that they are deprecated and what they are doing: for 1) it should trim locally and then upload; for 2) should download and merge together, saying that they are not separated anymore. This are not "new features" from my point of view, but consistency issues that should be solved before beta: the cli-interface is not changed at all, but under the hood the behaviour is consistent. In both cases using the deprecated features are giving bogus results while compared with the expected behaviour (e.g., trimming with and outdated ReadTools even if the provided one is the latest; or producing a different result with the separate download/merge and the combined one.

Anyway, up to you if you want to merge this PR. Looks good, but I will remove the source and keep only the binaries in the case that it is possible.

robmaz commented 6 years ago

Supporting the new ReadTools fully and without inconsistencies is not a new feature, we already had a PR about that. Which was not all of what needed to be done, but the remaining issues I will also iron out during the testing.

However, I would like to keep open the possibility of cluster trimming in principle. Trimming is inherently very parallelizable, and if you trim 100 GB bams locally, that also takes a couple hours, I think, whereas it would be a 10 minutes job on the cluster. Didn't you think about making a smaller readtools that we could also use on the cluster?

I'll remove src stuff and merge this.

magicDGS commented 6 years ago

Making a smaller ReadTools requires to do not pack some java dependencies, so there is no plan to do so yet (at least until GATK separate artifacts for easier picking dependencies - I've already open an issue for them).

In addition, supporting trimming in the cluster might speed up by parallelization, but:

  1. The cluster is usually for more intensive jobs such as mapping. Using it for saving "a couple of hours" (that use to be the case in the cluster with the perl script, and ReadTools already achieved the same performance), for the cost of reducing the computational power for mapping for other users is quite bad.
  2. Most of the people that is using ReadTools from the beginning are not trimming in the cluster anymore because the jobs tend to fail a lot, and it is safer to trim locally too, and it is fast enough for most of the cases.
  3. The trimming job generates lots of files and people tend to forget to clean the cluster. This cause the cluster to be too crowded to being able to perform mapping (which also generate quite a lot of files),
  4. Retrieving trimmed files is yet another network task that is reducing the performance of mapping and download the results.
  5. The current pipeline is based on an outdated ReadTools or perl trimming script (some bug fixes and statistics change). This cannot be solved until ReadTools is small enough or might return inconsistent trimmed files (if trimming locally or in the cluster), and in addition is missing some features from the new ReadTools.
  6. An Sparkified version of ReadTools is in my free-time schedule, but it won't happen soon, which might increase the speed of the trimming even locally (using a local spark runner). In addition, trimming while uploading (see #33) is a better option than parallelizing in the cluster, because the file is parsed anyway for uploading and saving time of two parsing will also speed up things (feature coming after my committee meeting or Christmas)

I believe that trimming in the cluster produces more nightmares than advantages, and DistMap is mainly designed for mapping (trimming was an additional feature required in the lab because locally was really slow with the perl script). If we want to go in the direction of supporting MapReduce jobs for all the steps in our pipeline, I will prefer to implement a distmap-like functionality directly in ReadTools, or a sparkified version of it. For example, a MapReduce class for trimming will be quite easy to implement promatically for any kind of input (distmap) FASTQ/SAM/BAM/CRAM), although pair-end will be trickier except for the distmap input.