rvalieris / LCS

9 stars 4 forks source link

Unclear how to generate a markers table from UCSC #6

Open hoelzer opened 2 years ago

hoelzer commented 2 years ago

Hi, thanks for providing such a pipeline!

However, it is unclear to me how to generate a new table using sequences tree generated by UCSC as a source based on the descriptions in the README. E.g., how should the snakemake pipeline be started then? This seems to be only explained for the Pangolin branch.

hoelzer commented 2 years ago

I was then able to run

snakemake --config markers=ucsc dataset=own-ucsc -j1 repo

just assuming that this is correct. Which first failed with

Cloning into 'data/pango-designation'...
warning: Could not find remote branch v1.2.60 to clone.
fatal: Remote branch v1.2.60 not found in upstream origin

thus I updated

PANGO_DESIGNATIONS_VERSION='v1.2.126'

(although I thought I can simply rely on the UCSC tree and skipping pangolin)

but then the pipeline just cloned the pangodesignation git and finished.

rvalieris commented 2 years ago

yeah sorry, it not very clear, but you don't need to run the repo rule when you are using ucsc markers, you can just skip this line and run the rest of the pipeline, just dont forget to update the PB_VERSION with a recent date.

hoelzer commented 2 years ago

Ah! So basically I update the PB_VERSION variable and run the piepline and then it wil automatically download and configure the marker table based on UCSC? I think it would be good to clarify this a bit more in the README.md

rvalieris commented 2 years ago

yes, set markers=ucsc and update PB_VERSION and you should be good to go

I added this intermediary repo rule for pango because the user needs to provide the gisaid sequences manually before running the pipeline proper, but this isn't an issue with the ucsc data.

MarieLataretu commented 2 years ago

Is it possible to only generate an updated table (and don't run the whole pipeline)? It would save resources when it can be used like a pre-generated-marker-tables afterwards.

rvalieris commented 2 years ago

hello -- yes, if you only want to generate the marker table for ucsc you could run it like this: snakemake --config dataset=pool markers=ucsc -- ucsc_gather_tables

if the marker table at outputs/variants_table/ucsc-markers-table.tsv already exists it won't run it again, but you can also force snakemake to re-run all steps anyway like this: snakemake --config dataset=pool markers=ucsc -F -- ucsc_gather_tables

MarieLataretu commented 2 years ago

Cool, thanks! I'm trying it out right now.

To change the PB_VERSION, you have to edit the config rule, right? At least this didn't work for me: snakemake --cores 15 --resources mem_gb=\$mem --config PB_VERSION='2022-05-01' dataset=pool markers=ucsc -- ucsc_gather_table

Just to create the table, pool can be empty, or?

And how much memory would you suggest as a minimum? My jobs on our HPC failed because of matUtils extract (seg faults or out of memory with 20 GB and 15 cores).

rvalieris commented 2 years ago

To change the PB_VERSION, you have to edit the config rule, right?

yes, it wont work with --config unfortunately.

the pool shouldn't matter at this stage, you can pass any string there.

And how much memory would you suggest as a minimum? My jobs on our HPC failed because of matUtils extract (seg faults or out of memory with 20 GB and 15 cores).

if you have the memory to spare try on a machine with at least 100G, otherwise you will need to downsample, because some variants, like Alpha, have a lot of sequences (>500k)

here is an example for how you could downsample to 10k sequences per variant:

diff --git a/rules/ucsc-matrix-pipe.py b/rules/ucsc-matrix-pipe.py
index 4892c2c..c856128 100644
--- a/rules/ucsc-matrix-pipe.py
+++ b/rules/ucsc-matrix-pipe.py
@@ -21,6 +21,7 @@ rule sample_list:
                out = open(output[0], 'wt')
                ls = params.lin.split(",")
                rows = h.loc[h['pangolin_lineage'].isin(ls)]
+               rows = rows.sample(n=10000)
                out.write("\n".join(rows['strain'])+"\n")
LauraVP1994 commented 1 year ago

I have to agree that it is not very clearly explained in the README.md what you need to change exactly and which command (and input) should be given....