suda-huanglab / circlehunter

A pipeline for identifying complex ecDNA using ATAC-Seq data.
3 stars 3 forks source link

Bugs when using circlehunter #6

Closed lufuhao closed 1 year ago

lufuhao commented 1 year ago

Hi, Ms Huang,

Good to know this program at the Bioinformatics Meeting in Kaifeng @20230708. I would like to try this excellent program, but got many bugs

  1. Is it possible to generate a list of cmds first and use bash to run it separately? Otherwise, we have to run it from the very beginning. This pipeline needs a lot of dependencies. So we have to start over once an error happens. So I would strongly recommend to print all the cmds to a file, and we can manually test and run each cmd.

  2. circlefinder.smk: line 43: should be

index=config['genome']['bwa_index']

  1. annotate.py does not support csi index, better to add long chromosome support. Many other genomes have extra-long chrs >=536M, and the routine .tbi index would NOT support it. So we have to use .csi index. So when open pysam.TabixFile, test which index exists and specify it using 'index' option, for example:

db = pysam.TabixFile(db_path, index=db_path+".tbi") OR db = pysam.TabixFile(db_path, index=db_path+".csi")

  1. In the configure file: line 10:

    path to reference genome size file

    Is it a 2-column file or routine bed file?

Thanks so much

yangmqglobe commented 1 year ago

非常抱歉,你遇到的问题可能与我们项目的初衷并不一致,因此我们无法为你提供更好的解决方案,以下是各个问题的详细说明,供你参考。

  1. 利用 snakemake 你可以获得所有将要执行的命令。两个有用的参数是-p-n。详细信息你需要参阅 snakemake 的文档。不过这种做法与 snakemake 的初衷背道而驰,snakemake 允许你从任何步骤开始,并且能够在错误中断后自动跳过已经完成的步骤,并继续运行剩余步骤。因此想要实现你的想法,你可能需要更详细地阅读 snakemake 的文档。至于你说的依赖,我不确定你说的是软件依赖还是数据依赖,经过测试,所有的软件依赖都可以参照 README 中的方法使用 mamba 安装。如果是文件依赖,你需要按照你的项目设计准备好基因组参考序列及其索引,大多数都可以在公开数据库下载到。
  2. circlefinder 不是我们维护的项目,我们的项目叫 circlehunter。因此,所有与 circlefinder 相关的代码都可能是测试时使用的废弃代码,不是我们项目的主体。
  3. 注释不是我们识别 ecDNA 的主要流程,你可以选择跳过该步骤。不过,这里还有个问题,如果.tbi不能支持的参考基因,显然.bai也是不被支持的,因此要想支持这些超长的基因组,可能不是仅仅改变注释流程可以做到的,这与我们的项目的初衷不同,不是我们考虑的问题之一。
  4. 这是一个两列的表格,你可以在 UCSC 下载到常见基因组的大小文件并在这里指定文件的路径。

希望这些说明对你有帮助。我不清楚你的项目是如何设计的,但显然与我们的项目设计不同,我们非常欢迎你迁移circlehunter来适应你的项目,也非常期待看到你的工作。

Translated by ChatGPT:

I'm sorry, but your concerns may not align with the original intention of our project, so we are unable to provide you with a better solution. Here are detailed explanations for each issue for your reference:

  1. With snakemake, you can obtain all the commands that will be executed. Two useful parameters are -p and -n. For more details, you need to refer to the snakemake documentation. However, this approach contradicts the original purpose of snakemake. Snakemake allows you to start from any step and can automatically skip completed steps and continue running the remaining ones after an error occurs. Therefore, to achieve your idea, you may need to read the snakemake documentation in more detail. Regarding the dependencies you mentioned, I'm not sure if you are referring to software dependencies or data dependencies. After testing, all software dependencies can be installed using mamba as described in the README. If it is file dependency, you need to prepare the reference genome sequence and its index according to your project design, most of which can be downloaded from public databases.

  2. circlefinder is not a project maintained by us. Our project is called circlehunter. Therefore, all code related to circlefinder may be discarded code used for testing and is not part of our project.

  3. Annotation is not the main process for identifying ecDNA in our project. You can choose to skip this step. However, there is another issue here. If .tbi is not supported for the reference genome, it is obvious that .bai is also not supported. Therefore, supporting these ultra-long genomes may require more than just changing the annotation process, which is not one of the considerations in our project.

  4. This is a two-column table. You can download the size file for common genomes from UCSC and specify the path to the file here.

I apologize for any inconvenience caused. I hope these explanations are helpful to you. I'm not sure how your project is designed, but it seems to be different from our project's design. We would be very welcoming if you migrate circlehunter to fit your project, and we look forward to seeing your work.

basilahmad01 commented 1 year ago

Hi, Ms Huang, ...

  1. Is it possible to generate a list of cmds first and use bash to run it separately? Otherwise, we have to run it from the very beginning. This pipeline needs a lot of dependencies. So we have to start over once an error happens. So I would strongly recommend to print all the cmds to a file, and we can manually test and run each cmd.

For the first question: You can use --printshellcmds to show all of the commands that are run. For example, snakemake --printshellcmds --snakefile ../Snakefile -j 16 --configfile config.yaml

lufuhao commented 1 year ago

Hi, Ms Huang, ...

  1. Is it possible to generate a list of cmds first and use bash to run it separately? Otherwise, we have to run it from the very beginning. This pipeline needs a lot of dependencies. So we have to start over once an error happens. So I would strongly recommend to print all the cmds to a file, and we can manually test and run each cmd.

For the first question: You can use --printshellcmds to show all of the commands that are run. For example, snakemake --printshellcmds --snakefile ../Snakefile -j 16 --configfile config.yaml

That would be quite helpful. Thanks so much