LS-BSR pipeline query - Githubissues

ReemaSingh commented 7 years ago

Hello Dr Jason,

I am currently using LS-BSR pipeline for my analysis as implemented in your paper "The Effects of Signal Erosion and Core Genome Reduction on the Identification of Diagnostic Markers". I understand the complete paper and able to implement the analysis till the pan and core genome analysis. However, I am still a bit confused about the implementation of core genome reduction and signal erosion. This is how I am implementing this:-

A) Core Genome Reduction

Random samples -> LS-BSR matrix -> "BSR_to_gene_accumulation-scatter.py" -> accumu-core-replicates.txt-> plotting

B) Signal Erosion

Random Samples+core CDS -> LS-BSR matrix -> "BSR_to_gene_accumulation-scatter.py" ->accumu-unique-replicates.txt -> plotting

I would like to ask - am I implementing it correctly for the analysis of core genome size and signal erosion? Did you plot the replicates value generated from "BSR_to_gene_accumulation-scatter.py" script? I really want to make sure that the way I am using this analysis is correct and would highly appreciate if you could please shed light on this.

I look forward to the reply!

Best Regards, Reema,

jasonsahl commented 7 years ago

Hi Reema,

For the core genome reduction, you should be able to reproduce that figure through your general workflow by plotting the core replicates and then also plotting the mean values that the script gives you for each sampling level. For the signal erosion figure, it's more complicated. In this case, I took the core genome that I identified, and screened it against a directory that contained a randomly selected number of non-target genomes that I selected with another script. I then ran the "compare_BSR" script on the new BSR matrix to get the number of unique markers that would serve as diagnostic targets. I did a lot of this processing in parallel and in for loops, but it still took a while to run that many replicates. If you have a similar application and want to know where to get started, I could try to point you to a more detailed workflow, so just let me know.

regards, Jason

ReemaSingh commented 7 years ago

Hello Dr Jason,

Thanks for your reply.

For the core genome reduction analysis, I have plotted the core replicates by following the same general workflow (I mentioned in my previous email) but instead of a reduction in core genome size with additional genomes, our data showing variations. e.g in one iteration with additional genome [50-400] the core size keep reducing than suddenly increase [see attached figure- NG-Core_size.pdf]. In fact in all these iterations, also included the analysis of the whole dataset (430 - total genome)[which suppose to produce the same set of core genome], but this is also showing variation.Now, It might be the nature of our dataset [Neisseria gonorrhoeae], but before jumping to the conclusion, I would like to make sure I haven’t done anything wrong in the analysis. So I would like to ask:-

As you have done 100 iterations, did you see some/same kind of variation in any of your iterations?
For the selection of random dataset, In your paper, you mentioned “genome was sampled without replacement from 1 to 400 with 100 iterations at each level. From each subsampling, a set number of genomes were randomly selected with a python script”. Could you please clarify is that what you mean by this:-

50 = random sample - > 100 datasets -> python scripts -> 50 random samples? 100 = random sample -> 150 datasets -> python scripts -> 100 random sample? OR Dataset -> python scripts -> 50 /100 dataset?
The way I have done “Dataset -> Random sampling without replacement in R -> 50/100 subsets”. Is this is could be the reason I am seeing variation in my dataset?
In Figure 3 - Is that a scattered dot plot? Sorry I may sound silly but I am asking because I tried to generate the same plot but completely messed up with my data, so I stick with the box plot.

For signal erosion - I need to repeat the analysis by using "compare_BSR" script. I would highly appreciate if you could please point me to the detailed workflow. This will be very helpful for me to make sure that the way I am implementing the analysis is correct.

Looking forward to the reply.

Best Regards, Reema,

jasonsahl commented 7 years ago

No, I've never seen this before. If you want me to check the method, you can share your input files and I can take a closer look.
For figure 3A, I just ran the BSR_to_gene_accumulation scatter script and only took the values for every 50 genomes. For Figure 3B, I selected out a random number of near neighbor genomes with a python script, then ran LS-BSR on the complete directory to see how many diagnostic markers were conserved in the target.
I would need a look at your complete workflow in order to see where things might be going wrong
Yes, a box plot would work fine though as well

So you're trying to see how many diagnostic markers are removed once you add in a sub-sampled number of non-target genomes? Just trying to clarify exactly what you're doing so I can point you in the right direction.

regards, Jason

ReemaSingh commented 7 years ago

Hello Dr. Jason,

Thanks for answering my questions. Your answers are really very helpful.

No, I've never seen this before. If you want me to check the method, you can share your input files and I can take a closer look. I would need a look at your complete workflow in order to see where things might be going wrong

Please give me some time I will send the complete workflow and input files in my next email.

So you're trying to see how many diagnostic markers are removed once you add in a sub-sampled number of non-target genomes? Just trying to clarify exactly what you're doing so I can point you in the right direction.

Yes. And also trying to see the effect of additional genomes on the core genome size in Neisseria gonorrhoeae.

Best Regards, Reema,

jasonsahl / LS-BSR

LS-BSR pipeline query #20