katholt / RedDog

33 stars 4 forks source link

Removing individuals from Merge output folder #68

Closed spencer411 closed 4 years ago

spencer411 commented 4 years ago

First, let me say we have been using the pipeline and it is a fantastic resource, so thank you for making it available!

We have a large dataset, with over 500 samples at this point, and we are having a couple of minor issues, all that are somewhat related to merging new runs.

The first is that it is starting to time out on us (specifically at the FastTree chromosomal tree building stage). I see that there is an option to turn this off after hitting 500 samples. Is that because of this problem, or should I assume something else is going on here? This is specifically when it is trying to build a new tree when merging a small set of 5 isolates to a ~500 isolate dataset.

The second is there were several instances where merge jobs were inadvertently killed while running. This has led to the problem where we get a long list of cns_warning.txt files in the output folder that read "No consensus sequence of replicon AE017336 for isolate WT-427, WT-427 removed from further allelic analysis". Is there a way to remove these individuals from the output folder completely (so we no longer get these .txt files) and, so we can re-run them with the same names (because as of now a rerun with the same isolate name will immediately quite because it says they are duplicates)?

A third question (related to the second) is: Can we remove individuals from the original folder we no longer want to include in the analysis and continue to merge to it. If so what is the best way to do that? Is this as simple as removing them from the sequence list, as well as the bam and vcf folders (e.g. are all the other output files written from scratch after each merge?).

That is it, and thanks so much for your time.

d-j-e commented 4 years ago

Glad to hear you like it - RedDog does have quirks that I'd like to rub out, but other matters distract me atm (a PhD for one...). You might be interested in the project, SNPPar, for mapping SNPs back to your tree...

Which brings us to your first query - FastTree is a good approximate ML method for obtaining trees, but it's not really designed for larger data sets, and can get a bit resource hungry. Whilst getting a tree is nice, we decided the pipeline finishing was the priority, especially with larger datasets (We have a few datasets > 5K isolates...). The tree step can always be done later (after qc; filtering repeat regions etc.). So RedDog automatically turns off the tree step to get things done. You can override this by changing force_tree = False to True in your config file. Just keep in mind if you set gets too large, you'll probably need to give more resources to the makeTree step (both time and memory) or the pipeline will fall over, just before it finishes...

My first advice about merge runs is to always (and as I can't stress this enough) ALWAYS do a merge run into a copy of your previous run - there are certain points in the merge run that if there is a failure, you can lose the lot! (Unfortunately, speaking from experience)

The solution for both your second and third query are the same - you are correct about the sequence list, bam and vcf. There are three other outputs that will need modifying; RepStats, AllStats and AllRepGeneCover. These three are not regenerated each time; the new results are added to them. AllStats has the two versions - AllStats.txt is one you need to modify (to remove an isolate, just remove the appropriate row); the other, AllStats_user.txt, is a regenerated each time (from the merged 'AllStats.txt').

Hope this helps...

spencer411 commented 4 years ago

Okay thanks, just confused about "AllRepGeneCover" as I don't see that in the output folder. Are you referring to the CoverMatrix.csv? See attached screenshot of output folder...

Screen Shot 2020-02-07 at 12 53 14 PM

Thanks again for the help!

d-j-e commented 4 years ago

Sorry about that, bit distracted by 24 month PhD report at time (waiting on outcome as I type).

Both the cover and depth matrix files will have to be edited to remove the isolates you want to rerun with the same name. You will have to remove the correct column(s) - Excel is easiest way, but be careful of the line endings afterwards. Fortunately, there are not separate files for each replicon. RedDog would run successfully without these being changed, but the gene summary and presence/absence counts would be affected.