How to combine NCBI SRA with local data on samplesheet.csv?

Erythroxylum commented 9 months ago

Hello, I have some samples from the SRA that I would like to include along with the local fastqs. I have downloaded the metadata sample sheet from the SRA as directed but I am curious how to combine it with the samplesheet.csv created for the local files? I have attached a version of the sample sheet where I just included the SRA metadata that corresponds with the samplesheet columns created with your python script, so the fq1 and fq2 columns for the SRA are 'NaN'. Should this work? test_SRA_samplesheet.csv Thanks for the help!

cademirch commented 9 months ago

Currently it's not possible. It has been on my mind to add for some time though. It should be pretty straightforward, so I'll take a crack at it today.

Erythroxylum commented 9 months ago

Awesome! That would be easier than downloading everything. Thanks.

cademirch commented 9 months ago

@Erythroxylum, I added this on the branch local_and_sra. Working on the PR #146 now, but you can checkout that branch and test it out. Let me know if you run into issues.

Erythroxylum commented 9 months ago

Hi @cademirch, the pipeline ran until sort_gatherVcfs. I ran the main branch with the same test batch of fastqs and without the sra sample and snparcher completed. The err file and log files are attached. Here is the log error:

Checking the headers and starting positions of 142 files [E::bgzf_flush] File write failed (wrong size) [buf_flush] Error: cannot write to /tmp/00114.bcf Cleaning

Thanks for your help! sratest20_log.txt sraerr19.txt

cademirch commented 9 months ago

@Erythroxylum I'm not sure about this, but I think this looks like a space issue in the tmp dir? Tagging @tsackton since he's familiar with this cluster.

tsackton commented 9 months ago

@Erythroxylum is big temp set to something like "./tmp" in your config? if not that is an easy place to start.

Erythroxylum commented 8 months ago

Hello, big temp was not set to anything, so I set as above but now there is an error:

Traceback (most recent call last): File "/n/holyscratch01/davis_lab/dwhite/snpArcher-shared-testsra-fastq/snpArcher/./profiles/slurm/slurm-submit.py", line 7, in from snakemake.utils import read_job_properties ModuleNotFoundError: No module named 'snakemake'

Did changing the /tmp cause this? run_pipeline.sh and the log file are attached. Replicating the commands in run_pipeline.sh within an interactive session, snakemake --version is 7.28.3 sraerr-nomodulesnakemake.txt run_pipeline.sh.txt

tsackton commented 8 months ago

I suspect this is an issue with conflicting Python versions somewhere. Not sure why it wasn't a problem before - can you check what version of python you have in your Snakemake environment, and also what the default Python version is (e.g. after module load python but before you activate the Snakemake environment)?

Erythroxylum commented 8 months ago

module load python python --version

Python 3.10.12 conda activate snakemake python --version

Python 3.11.4

On Wed, Jan 3, 2024 at 8:44 AM Tim Sackton @.***> wrote:

I suspect this is an issue with conflicting Python versions somewhere. Not sure why it wasn't a problem before - can you check what version of python you have in your Snakemake environment, and also what the default Python version is (e.g. after module load python but before you activate the Snakemake environment)?

— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/145#issuecomment-1875393294, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLR5LXU3S7D2PHBVHSCNETYMVOENAVCNFSM6AAAAABAWUM7NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVGM4TGMRZGQ . You are receiving this because you were mentioned.Message ID: @.***>

tsackton commented 8 months ago

I don't fully understand why this error is occurring for you now, only after changing the big temp line in the config. My best guess is this is related, somehow, to using the python cluster module, but I am not really sure.

I would suggest that the easiest thing to try is to start with a fresh install of mamba, using miniforge3: https://github.com/conda-forge/miniforge (use Miniforge3-Linux-x86_64). Then, you might need to delete your existing conda environments. You can recreate your snpArcher environment with mamba create -c conda-forge -c bioconda -n snparcher snakemake=7.32.4 (note snpArcher doesn't work with snakemake 8.0 yet).

Then you should ideally be able to just have the first line of your run_pipeline.sh script be mamba activate snakemake and remove the module load and the existing conda stuff.

I'm not positive that will work, because this is a strange error, but it feels like some kind of python version thing which is likely a conda issue, so this is probably the best troubleshooting place to start. Please keep us posted if you have any issues.

Erythroxylum commented 8 months ago

Fresh install of mamba seems to have worked. @cademirch I went to clone the branch again and it said the branch has been removed. What is the status now?

git clone -b local_and_sra --single-branch https://github.com/harvardinformatics/snpArcher.git

Cloning into 'snpArcher'... warning: Could not find remote branch local_and_sra to clone. fatal: Remote branch local_and_sra not found in upstream origin

tsackton commented 8 months ago

We merged that branch a few weeks ago. You should be fine just using main

Erythroxylum commented 8 months ago

OK, well you can close this then. Thanks so much for your prompt attention!

harvardinformatics / snpArcher

How to combine NCBI SRA with local data on samplesheet.csv? #145