hoffmangroup / umap

24 stars 2 forks source link

Some questions with running the Umap process #19

Open Miracle-Yao opened 1 month ago

Miracle-Yao commented 1 month ago

Hi, @EricR86 @mehrankr

Thanks for providing an excellent program for calculating genome mappability. I'd like to inquire about a few questions I've had while running umap here.

  1. I'm using the latest version I just downloaded (umap-1.2.1 tag), It seems that umap-1.1.1-py2.7.egg is used during the installation and compilation process, so I'm not quite sure which version of umap I'm actually installing. I think clarity on the version was crucial because during the next implementation, I discovered some new issues.

  2. After I executed the ubismap.py script, I tried to get the file for the specified K (e.g. 50mer) using the get_kmers.py script. The script I use is as follows:

    #!/bin/bash
    for i in {0..2441}; do
    python /home/software/umap-1.2.1/umap/get_kmers.py \
    umap/susScr11/GenomeMappability/chrsize.tsv \
    /home/data/umap/k50 \
    /home/data/umap/GenomeMappability/chrs \
    /home/data/umap/GenomeMappability/chrsize_index.tsv -job_id $i
    done

    Although the job id is iterated from 0 to 2441, the generated file as of chrY.2440.50.kmer.gz does not produce chrY.2441.50.kmer.gz. The log file is as follows:

    Traceback (most recent call last):
    File "/home/software/umap-1.2.1/umap/get_kmers.py", line 262, in <module>
    chromsize_path, out_dir, kmer, job_id, chr_dir, idx_path = get_args()
    File "/home/software/umap-1.2.1/umap/get_kmers.py", line 61, in get_args
    job_id = int(os.environ[args.var_id]) - 1
    File "/home/miniconda3/envs/python2/lib/python2.7/UserDict.py", line 23, in __getitem__
    raise KeyError(key)
    KeyError: 'SGE_TASK_ID'
    Created all sequences for chr1:1-1000000
    Created all sequences for chr1:1000001-2000000
    ......
    Created all sequences for chrY:41000001-42000000
    Created all sequences for chrY:42000001-43000000

    I don't know why I get the previous KeyError: 'SGE_TASK_ID' error after running it. Also, the last record for chrY that you know of in the chrsize_index.tsv file looks like this 2441 chrY 43000001 43547829. In other words, the Created all sequences for chrY:43000001-43547829 step is missing. How to avoid the above KeyError: 'SGE_TASK_ID' and how to be able to generate the chrY.2441.50.kmer.gz file properly?

  3. Most of the sequencing data in ENCODE is single-ended, with read lengths ranging from 24, 36, 50 to 100, so these lengths were used in your previous work to generate mappability tracks. With the widespread use of pair-ended sequencing, especially the PE150 model, it would be useful to integrate the data from the previous SE50, SE100 models and the data from the existing PE100,PE150 model, how to choose the appropriate kmer? 50, 100, 200 and 300?

    We used read lengths of 24, 36, 50 and 100 bp to generate mappability tracks for unmodified and bisulfite-converted genomes of human (GRCh37 and GRCh38) and mouse (GRCm37 and GRCm38) (Retrieved from https://doi.org/10.1093/nar/gky677)

Thank you for your time and consideration. I am looking forward to your reply.

mehrankr commented 1 month ago

Hello @Miracle-Yao

Please open each item as a separate issue going forward so we can get to them in order.

I will let @EricR86 get back to your about 1.

Regarding 2., I think I know what's happening. Originally the script was developed for a version of SGE that only supported task IDs >= 1 Would you mind replacing all instances of job_id = int(os.environ[args.var_id]) - 1 with job_id = int(os.environ[args.var_id]) - 1

And also any where else where job ID is being subtracted by 1?

If not resolved, please provide a link to the chromosome size file and fasta file so we can look into this.

Regarding 3.; there is no straight answer since mappability of paired fragments varies by their length. I recommend considering the discussion in the manuscript and figure 7a to decide on the proper length. This could vary for each study as well.

EricR86 commented 1 month ago

@Miracle-Yao The tagged version is likely correct despite whatever version reported by pypi/python. I suspect the configured version in the code was simply not kept up to date. The tagged commit is simply what was effectively left as the last working state without much support for the time being.

Miracle-Yao commented 1 month ago

Hi, @EricR86 @mehrankr

Thank you for your quickly reply & useful suggestions.