makovalab-psu / DiscoverY

K-mer based classifier for Y-contig identification from Whole Genome Assemblies
MIT License
11 stars 5 forks source link

Better description of how to run the tool would be helpful #4

Open rsharris opened 5 years ago

rsharris commented 5 years ago

The current readme doesn't clearly describe how the user can use the tool to solve the problem it is intended to solve. If I have an assembly, and I want to identify Y-specific contigs, how do I do that?

My best guess, from trying the run the example in the repo, is that the info about which contigs are Y-specific is encoded in the headers of proportion_annotated_contigs.fastq. But this information in not described in the readme. Nor is any step mentioned that will separate Y contigs from the input contigs.

Note that that conclusion is based on the fact that, for me, the output of discovery.py (proportion_annotated_contigs.fastq) is identical to the input (data/male_contigs.fasta), except that annotation has been added.

The command I ran was the one shown as "a typical run": python discoverY.py --female_bloom --mode female+male But it is not clear whether this is the appropriate command to run for the example. Based on the files provided, and after digging through the code to see which options would cause all the provided files to be used, that was the command I can up with. This would be made clearer by having a "tutorial" section in the readme that showed the command to be run.

It would also be helpful to provide, as part of the example, the expected output. As it stands, I don't know whether my run of discovery.py worked. It's possible that it is not working and that this has fed into my misunderstanding of how it is supposed to be used.

It's also possible that I don't understand what the example is intended to demonstrate.

The discussion of 'best mode' and the jupyter notebook stuff should clarify whether this step is intended as part of the tyipcal usage pipeline or not. After having a lot of difficulty with the notebook, and looking at it in more detail, and realizing that it doesn't read the output from discovery.py, my best guess is that this is a pre-computing step, to be run before discovery.py, to guide the choice of threshold. However (assuming that is true), there's nothing that indicates how the resulting threhold would be used.

To recap, as it is currently described quite a bit of insight, digging, and guesswork is required on the part of the user.

rsharris commented 5 years ago

I should add that when I run the example, it reports that a proportion of 1.0 for each and every contig. That seems really strange -- it would be strange example. How can I know whether this is expected or if instead it's an indication somethings wrong with my installation of the program?

rsharris commented 5 years ago

In the current readme, the threshold output by discoverY.py is described as "proportion_shared_with_female". But I think it is really "proportion_NOT_shared_with_female".

Thus values closer to 1 mean a contig is more likely to be from Y.

deilepaita commented 3 years ago

Agree!

Output of DiscoverY in README.md should be corrected to "proportion_NOT_shared_with_female", because after running DiscorerY, contig file has the following header: '>Sc0000000 7492748 0.012910911534003885 102.0'; while the printed results in the terminal are: 'No. of contigs seen so far: 1 Current contig ID is : Sc0000000 Median is: 102.0 Total No. of k-mers from this contig: 7492732 No. of k-mers not shared with female: 96738 Proportion is: 0.012910911534003885'

Another correction that should be made is the description on how to calculate k-mers from male reads, because it indicates: "cd dependency ln -s ../data/female.fasta #make sure the correct reads file is provided to DSK ./run_dsk_Linux.sh r1.fastq 25" which is misleading for new users. Why do the user needs to soft link female.fasta to dependency if it is completely unnecessary for running DSK?