mgharvey / seqcap_pop

A pipeline for de novo assembly of population-level sequence capture datasets.
Other
11 stars 6 forks source link

Regex needed in extract_uce_bypass #2

Open ericopolo opened 7 years ago

ericopolo commented 7 years ago

Hi, Dr. Harvey!

I think this issue is actually related with the Tutorial in README.md. This is my first try on your pipeline and when I ran the script extract_uce_bypass I noticed that the number of unique contigs recognized was about tenfold less than what I get from Faircloth's phyluce_assembly_match_contigs_to_probes script, using the same contigs and probes files (by the way your 4715-probes.. file). Since the phyluce script uses a regex in order to find the desired loci in probes file, I figured out that passing a regex to extract_uce_bypass does the trick. But I was successful only after realize that I should also inform the replacement argument --repl, as well as the correct sintax for that, whereas I wasn't yet familiarized with the way to pass references to the re.sub function.

I'm not sure if that was supposed to be obvious (or even if there is another, easier solution for that), but I think adding this information to the tutorial would prevent some future user (as inexperienced as myself, at least) from spending time figuring out how to make the script work properly. Perhaps just adding the --regex and --repl arguments to the examples should be enough.

I hope this report to be useful somehow...

Cheers, and thank you so much to provide this pipeline!

auzzie599 commented 7 years ago

I am also having the problem of extract_uce_bypass_MGH returning fewer contigs matching to UCE probes than the Faircloth script phyluce_assembly_match_contigs_to_probes. Can you elaborate on how you solved this problem with the--regex and --repl arguments, including what you specified for those arguments?