Closed mrese001 closed 1 year ago
Hey Lucas!
Apologies for the late reply but I wanted to make sure that I come to you with a more informed question. With regard to my last question I had the support team here at the computing cluster department help me and I just needed some more packages to upload so I could initialize the environment correctly. Once we initialized the environment we ran the Arabidopsis example and it completed without any issues - good news there!
Now that I have the gist of how the files are organized I am attempting to run two fastq.gz files against a reference. I attempted to run your script to create the samples.tsv file but I get this error message: [image: image.png]
Is there anything inside of the generate-table.py script that I should change to properly point grenepipe to the proper directory so it can properly develop the table? Is there some way to generate the table manually?
Below is how I setup my table but what I need to change in order to run grenepipe is the 5th column where it's pointing to the example directory. Since I am running these two citrus trees through the pipeline would I even need the 5th column since they are single-end reads? @A01535:218:HLYGKDSX5:1:1101:24758:1000 1:N:0:GACGAGATTA+AGGATAATGT [image: image.png]
Many thanks for taking the time to read this and for your advice,
Mariano
On Fri, Apr 7, 2023 at 9:49 AM Lucas Czech @.***> wrote:
Hi Mariana,
what is the exact command that you ran there? It looks like you are missing something there, such as specifying the --directory option? See here https://github.com/moiexpositoalonsolab/grenepipe/wiki/Quick-Start-and-Full-Example#running-the-pipeline and here https://github.com/moiexpositoalonsolab/grenepipe/wiki/Advanced-Usage#working-directory for more information on that :-)
Cheers Lucas
— Reply to this email directly, view it on GitHub https://github.com/moiexpositoalonsolab/grenepipe/issues/31#issuecomment-1500455815, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE2ULM7VIAN3UUZPYZZN2TXABAR3ANCNFSM6AAAAAAWWWCJFQ . You are receiving this because you authored the thread.Message ID: @.***>
Hey Mariano,
the image that you tried to attach to your message did not get posted here, see above. Can you please post it again here on GitHub directly, instead of answering the thread via email?
I've also seen the email that you send me, which seems to relate to a related issue?! Please use the GitHub issues here for asking these questions, as that will help others with similar issues to find solutions as well.
Assuming that both your issue here and your email are about the same problem, I'll answer here.
Is there anything inside of the generate-table.py script that I should change to properly point grenepipe to the proper directory so it can properly develop the table? Is there some way to generate the table manually?
All of this is explained in the wiki: https://github.com/moiexpositoalonsolab/grenepipe/wiki/Setup-and-Usage#samples-table The script accepts the path to where it should look for fastq files, so you don't need to edit it to point to the directory. You can of course also create the table manually, as explained in the wiki as well.
Below is how I setup my table but what I need to change in order to run grenepipe is the 5th column where it's pointing to the example directory. Since I am running these two citrus trees through the pipeline would I even need the 5th column since they are single-end reads?
The column fq2
should always be present in the table (I might relax that in the future, but it does not seem to be an urgent fix), but the fields can be left empty if there is no paired end read. If you can share your table here, I can also have a look at what's wrong with it.
And from your email:
.. but now that I am doing my own samples, it gives me errors, which I will share here. The errors, I think, stem from the way I am organizing my directory to run the grenepipe but I would appreciate some clarity.
The error message of which you send me a picture in that email contains the hint that you are looking for:
Expected 3 fields in line 3, saw 4
Without having access to your samples table, I can't tell for sure, but my guess is that your number of items per row is not consistent, and that you for example forgot a column header, or have tab characters somewhere where they should not be. Please check that your table is following the schematics.
Hope that helps, and so long Lucas
Hey Lucas!
Again, apologies for the late reply - I've been troubleshooting but did not realize an important comment that you have made. You were correct in your evaluation of the last error message and that so far is my main issue. I for some reason had some spaces placed in the header and file fields and not tabs as required. The workflow is processing and will update here if I run into any issues.
Many thanks for your patience! Mariano
Hey Lucas,
Although the workflow did initialize and create some files (trimmed, logs, qc, benchmarks) at some point the job failed when it was attempting to map reads. Here is the error of the workflow:
Error in rule map_reads: jobid: 22 output: mapped/S1-1.sorted.bam, mapped/S1-1.sorted.done log: logs/bwa-mem/S1-1.log (check log file(s) for error message)
RuleException: CalledProcessError in line 58 of /bigdata/seymourlab/mrese001/grenepipe-0.12.0/rules/mapping-bwa-mem.smk: Command 'set -euo pipefail; /rhome/mrese001/.conda/envs/grenepipe/bin/python3.7 /bigdata/seymourlab/mrese001/grenepipe-0.12.0/my_citrus_analysis2/.snakemake/scripts/tmpwgqzrpbu.wrapper.py' returned non-zero exit status 1. File "/rhome/mrese001/.conda/envs/grenepipe/lib/python3.7/site-packages/snakemake/executors/init.py", line 2293, in run_wrapper File "/bigdata/seymourlab/mrese001/grenepipe-0.12.0/rules/mapping-bwa-mem.smk", line 58, in rule_map_reads File "/rhome/mrese001/.conda/envs/grenepipe/lib/python3.7/site-packages/snakemake/executors/init__.py", line 568, in _callback File "/rhome/mrese001/.conda/envs/grenepipe/lib/python3.7/concurrent/futures/thread.py", line 57, in run File "/rhome/mrese001/.conda/envs/grenepipe/lib/python3.7/site-packages/snakemake/executors/init.py", line 554, in cached_or_run File "/rhome/mrese001/.conda/envs/grenepipe/lib/python3.7/site-packages/snakemake/executors/init.py", line 2359, in run_wrapper Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /bigdata/seymourlab/mrese001/grenepipe-0.12.0/my_citrus_analysis2/.snakemake/log/2023-05-10T113431.107190.snakemake.log
Here are the details of the error message inside log/bwa: [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 274716 sequences (40000193 bp)... [W::sam_hdr_create] Ignored @SQ SN:scaffold_2 : bad or missing LN tag [E::sam_hrecs_update_hashes] Header includes @SQ line "scaffold_2" with no LN: tag [E::sam_hrecs_update_hashes] Header includes @SQ line "scaffold_2" with no LN: tag samtools sort: failed to change sort order header to 'SO:coordinate'
Many thanks as always, Mariano
Hi Mariano,
ha! Progress! Going from error to error until it's fixed :-)
So, this is a new problem that I have not seen before. It's happening in an internal step that I did not even code myself :-D It might be a bit tricky to figure out where this is coming from. Let's see:
--use-conda
when calling snakemake? It could be that this is coming from using a wrong version of samtools by accident. Can you call samtools --version
in your terminal, and let me know what that gives you?If neither of that helps or fixes it: How large is your dataset? If you could send me the ref genome, and one sample for which the error occurs, along with the config file that you are using, I can try to re-create the issue for further debugging.
Cheers and so long Lucas
Hi Lucas,
Thanks for your response, you are correct this error was generated with my own dataset. In previous runs of the workflow I always added --use-conda but was advised by our cluster admin to omit this. So I have been running the workflow like this: snakemake --cores 4 --verbose --directory
CreateCondaEnvironmentException: Could not create conda environment from /bigdata/seymourlab/mrese001/grenepipe-0.12.0/rules/../envs/picard.yaml: Collecting package metadata (repodata.json): ...working... File "/rhome/mrese001/.conda/envs/grenepipe/lib/python3.7/site-packages/snakemake/deployment/conda.py", line 389, in create
I am currently module loading samtools 1.17. As for the index files ... I do have .fai files in the same directory that the .yaml file is pulling the references from. Do I need to do anything more apart from keeping them in the same directory? Apologies if I missed this from the Wiki. Since I'm running this on a MacBook I'm looking for the Activity Monitor GUI correct? Or did you mean within the terminal? I am running this through a cluster environment.
I will send you the files you've asked for via email. Thanks again!
Mariano
Hi Mariano,
Thanks for your response, you are correct this error was generated with my own dataset
Ah okay, does that mean you managed to run the example now?
In previous runs of the workflow I always added --use-conda but was advised by our cluster admin to omit this.
Well, in that case, you'd have to set up every tool that is used in grenepipe on your own, ensuring that all of them are in the correct versions that we need... That seems like a rather large overhead, and I don't understand why your cluster admin advises to do that. Using conda is a bit more complicated on a cluster, because likely you will have to set it up locally on your own (as explained in the grenepipe wiki) - but likely still way easier than trying to get everything to run with the cluster module system. Up to you though. However, in that case, I won't be able to help you very much, as that would entail a lot of debugging on the cluster, I assume... There is a reason that grenepipe uses conda: dependencies are difficult, and conda makes that at least a bit easier (once you get past the initial trouble of getting conda to work). So, without conda, you'd have to figure this out with your cluster admin on your own then. I'd hence highly advice against this, and instead try to get conda/mamba to work.
So I have been running the workflow like this: snakemake --cores 4 --verbose --directory
I'd recommend to use mamba for the package management, by adding --conda-frontend mamba
to the call. That requires that you have mamba set up, as also explained in the grenepipe wiki.
When I do add --use-conda I get this error:
CreateCondaEnvironmentException
Yes, that happened with conda to me before, for that particular environment. This was hopefully solved in grenepipe v0.12.0 though. Are you using this or a later version? If not, please download a more recent version of grenepipe, v0.12.0 or later! If this still does not work, that would be strange, but could happen. Let me know. Also, using mamba has solved the issue before, so if you are going to use mamba anyway, I guess that this will be solved as well.
I am currently module loading samtools 1.17.
That's different from what grenepipe uses, but might still work. You'd need to test if this is the cause if the issue. If so, it will also be solved when you use conda/mamba to get the versions that grenepipe uses internally.
As for the index files ... I do have .fai files in the same directory that the .yaml file is pulling the references from. Do I need to do anything more apart from keeping them in the same directory?
Nope, that should be it. I'd recommend to let grenepipe create them though, in case they were created before with some incompatible tool. So, to not delete any of your existing files, you could copy the ref genome fasta
file (and just that one) to a new directory, and use that in the config.yaml
instead. grenepipe will then create all index files it needs automatically in that directory as well.
Since I'm running this on a MacBook I'm looking for the Activity Monitor GUI correct? Or did you mean within the terminal? I am running this through a cluster environment.
I am not sure that I understand. You are running grenepipe on a MacBook, but that is within a cluster environment? So, your cluster nodes are Macs? With the command that you gave above (snakemake --cores 4 --verbose --directory ...
), it does not look like this is a cluster - you are missing the --profile
. Please see the cluster page for details on that.
If you intend to run grenepipe locally on your own MacBook, you can use conda/mamba more easily, but it won't scale to large datasets. If you are using a cluster, I'd be surprised to see that you have a cluster of Macs. But in that case, you'd typically want to follow the wiki on how to use --profile
, or at least run grenepipe in a tmux or screen session. Or did you mean to say that you use your Mac to connect to the cluster, and run things there? Again in that case you want to use --profile
to actually make use of the cluster (otherwise, you'd be running the whole pipeline on the login node), as well as follow the other steps as explained in the cluster wiki. Please clarify ;-)
Either way, as for the system monitor: That needs to be run on the computer that is actually executing the step that is causing trouble. If this is locally on your Mac, then yes, the Activity Monitor GUI seems to be the right tool. If you are running this on a cluster, it's more difficult, as you usually will have to ssh
into the node where things are running, which you'd need to figure out first while it's running. I recommend to ask your cluster admin for help on that. Alternatively, you can use the cluster log files to debug afterwards (once the crash has happened), as explained in the cluster wiki as well - there is a Troubleshooting example that specifically addresses the out-of-memory problem. Following this should give you an idea where the error is coming from.
Hope that helps, so long Lucas
Hi Lucas!
Yes, I was indeed able to run the Arabidopsis example and since the last time we discussed I solved the above issue with my cluster admin and put some citrus genomes through Grenepipe! However, now that I am doing the process for paired-end reads I am receiving a different error. But before I get too ahead of myself let me clear up some points discussed before: I am now using --use-conda and --conda-frontend mamba in my troubleshooting and am also using grenepipe v12.0. I did indeed mean to say that I use my Mac to connect to the cluster and run things there sorry for being unclear.
Thank you for your suggestions and for pointing out that in order to use a bash job submit script I need a --profile. I did not have this at all as you've caught and this might be why I'm running into errors.
At the moment I am troubleshooting using tmux in an interactive node with your suggestions while I also troubleshoot with the bash script now with a profile appended. Will I need to specify a --profile when running the snakemake interactively?
When I ran it this time I opened a tmux session, got into an interactive cluster using srun:
srun -p seymourlab --mem 50G -N 1 -n 16 --time 10-00:00:00 --pty bash -l
then I conda activate grenepipe and module load these tools:
module load samtools
module load trimmomatic
module load fastqc
module load bwa
module load picard
module load gatk/4.3.0.0
module load bcftools
module load seqkit
and lastly I run:
snakemake --cores 16 --conda-frontend mamba --verbose --directory /rhome/mrese001/bigdata/GRENEPIPE/parents_draft
I have my entire error output here:
Hi @mrese001,
thanks for the clarifications!
Will I need to specify a --profile when running the snakemake interactively?
That depends on what you want to do/achieve here. With --profile
, snakemake will submit steps of the grenepipe analysis as jobs to the cluster. Without, it will run it "locally", which in your case of an interactive session means, running it on the node where that session lives. If you just want to test the general setup with a small example or test dataset, the latter is a good way to do so. However, for big runs, where you actually want to leverage the power of the cluster, of course you want to submit steps as jobs, so that more compute nodes can parallelize the work.
As for the issues you are observing now: it is unclear what exactly is happening from what you posted. It seems you are not using --profile
here, meaning that there are no slurm log files to check. It could be that your interactive session simply does not have enough memory for certain jobs. I recommend checking the troubleshooting for some more info on that. However, as you are running it "locally" on your interactive node, you will have to check via some other ways whether there is enough memory, for example by opening a second terminal on that exact same node, and running htop
there. Please also ask your admins for support there. Alternatively, run it again with a slurm profile, and check the slurm error and out files for out-of-memory errors, as described in the link above.
Furthermore, another potential source of errors are your module load
commands prior to running grenepipe. As described here, this can be difficult. Those modules might overwrite the paths set by conda, leading to conflicting versions, which can cause errors as shown above. Is there a reason you are doing that? Try leaving that out, and just rely on conda/mamba for the package management.
Cheers Lucas
Hi @mrese001,
any updates? If things are working now, shall we close this issue?
Cheers Lucas
Hey @lczech !
Thank you for checking in. I still have yet to figure it out and will need time to establish a consistently functional pipeline. I would like to use the cluster resources and need to read up on the proper way to set up a profile. Since I am under some time constraints I had to put a pin on Grenepipe for now but I hope to return to it later. Thank you for your continued help and support!
Mariano
Okay, take your time :-)
Let's close this issue then for now though, as I feel we have answered its original question. If you encounter more issues down the line, feel free to re-open or stat a new one!
Cheers Lucas
Hello Lucas,
I am a bioinformatics beginner so I apologize in advance if my question is a trivial one: I am getting this error message once I try to run the pipeline:
Building DAG of jobs... MissingRuleException: No rule to produce example (if you use input functions make sure that they don't raise unexpected exceptions).
What does this mean exactly and how can I possibly fix it?
Thank you! Mariano