sanjaynagi / AmpSeeker

A state-of-the-art snakemake workflow for amplicon sequencing
https://sanjaynagi.github.io/AmpSeeker/
0 stars 3 forks source link

Added kdr origins notebook #102

Closed EricRLucas closed 5 months ago

EricRLucas commented 5 months ago

Added kdr origins notebook.

Partially addresses #38.

sanjaynagi commented 5 months ago

Hey @EricRLucas, the last thing we are missing (I think) is the file KdrMarkerSnps.csv. Best place to store this is in the resources/ folder.

Also needs to be designated as an input to the kdr-origins rule, that way if its not there, snakemake will throw an error before the pipeline is run.

sanjaynagi commented 5 months ago

Hey @EricRLucas .

Probably no need to add an option in the config for the kdr_marker_snps.csv path, as its not something we would ever want to change - the file will stay in resources/ and should always be there.

The CI runs are now failing, as they run using config files under AmpSeeker/.test/config/. We have to update these two configs every time we make changes to the main config file.

For now, I would just hard code the path resources/Kdr_marker_snps.csv into the input of the kdr-origins rule, and pass that to papermill in the shell block.

EricRLucas commented 5 months ago

@sanjaynagi Cool., There seem to be several configs in .test. Shall I modify config_agvampir.yaml?

sanjaynagi commented 5 months ago

I've fixed it @EricRLucas . No need to edit the .test configs further as I've removed that option from the main config

sanjaynagi commented 5 months ago

Very confusing that its failing atm. Cant find the file in resources/ag-vampir/ but defo there and working locally.

sanjaynagi commented 5 months ago

Ahh, i forgot that .test/ folder needs its own resources/

EricRLucas commented 5 months ago

@sanjaynagi Not sure why it's throwing this error. It works fine with the vcf that you gave me to test the notebook on. What vcf does it use in the testing? I can can have a look at what happens when I run locally with that vcf

sanjaynagi commented 5 months ago

@EricRLucas its all good - I realised that the current test data uses a mini 'reference' genome which is 2L:2,000,000-3,000,000. As a result all the coordinates are off, and so we dont find any intersecting variants in the notebook when running through CI.

Im going to resolve this, probably by re-doing how we do the reference for the test data. Ill probably just add a wget command before the CI runs to download the whole AgamP4 FASTA file, so when we align to it, we have proper coordinates. Ill probably do that in another PR, so I'll merge this soon anyway. Thank you once again!

EricRLucas commented 5 months ago

@sanjaynagi Cool, though sounds like you'll need more than just the correct coordinates, because your current reference genome doesn't actually include the kdr region, so none of the SNPs will have genotype calls.

sanjaynagi commented 5 months ago

@EricRLucas kdr is within 2-3Mb of 2L? but in any case, this way of downloading the whole reference and mapping to that, and genotyping all target snps will be much better.

EricRLucas commented 5 months ago

@sanjaynagi Ah yes, sorry, I read it as 20-30Mb.