ultra installation and run error

unique379r commented 2 years ago

Hi After 3 installation failed from INSTALLATION script, i finally found conda ultra package to install it correctly.Next, i ran the following command to execute for test data but got the error as following:

reads="reads/alz.polished.hq.fasta.gz"
genome="GRCh38.v33p13.primary_assembly.fa"
gtfexons="gencode.v33p13.primary_assembly.annotation.exon.gtf"
ultra="/scratch/rupesh/Apps/envs/ultra/bin/uLTRA"
output_dir="HIFI/test_results"

echo "Running ultra full pipeline without creating index or alignment.."

$ultra pipeline --isoseq --t 1 $genome $gtfexons $reads $output_dir

ERRORS:

 Traceback (most recent call last):
  File "/scratch/rupesh/Apps/envs/ultra/bin/uLTRA", line 722, in <module>
    align_reads(args)
  File "/scratch/rupesh/Apps/envs/ultra/bin/uLTRA", line 350, in align_reads
    nr_reads_to_ignore, path_reads_to_align = prefilter_genomic_reads.main(ref_part_sequences, args.ref, args.reads, args.outfolder, index_folder, args.nr_cores, args.genomic_frac, args.mm2_ksize)
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/site-packages/modules/prefilter_genomic_reads.py", line 143, in main
    path_reads_to_align = print_read_categories(reads_unindexed, reads_indexed, reads, outfolder, SAM_file)
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/site-packages/modules/prefilter_genomic_reads.py", line 120, in print_read_categories
    for acc, (seq, _) in help_functions.readfq(open(reads,"r")):
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/site-packages/modules/help_functions.py", line 89, in readfq
    for l in fp: # search for the start of the next record
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
**UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte**

Progress log:

Running ultra full pipeline without creating index or alignment..
total_flanks2: 467112
total_flank_size 527488947
total_unique_segment_counter 145303455
total_segments_bad 83027565
bad 1233041
total parts size: 146506923
total exons size: 359149696
min_intron: 1
Number of ref seqs in gff: 323106
Number of ref seqs in fasta: 194
Warning: Detected 147 sequences in reference fasta that are not in annotation:

KI270466.1 with length:1233
KI270706.1 with length:175055
KI270382.1 with length:4215

..
...

ACTGCAGTGGCGCAATCTCG 200
CAACCTCTGCCTCCCTGGTT 200
223114 223114 out of 467112 sequences has been modified in masking step.
Filtering reads aligned to unindexed regions with minimap2

Please help.

ksahlin commented 2 years ago

Hi @unique379r,

Glad you got it installed through conda! The fasta/q parser is complaining. This is a well-tested function, so my bet is that there is something wrong with the fasta file.

Perhaps you can manually check the first lines with

zcat reads/alz.polished.hq.fasta.gz | head -n 10

does everything look as expected for fasta format?

the problem is some unicode character seen from the line

**UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte**

It seems to be at the very beginning of the file.

Second idea is to try uLTRA with the unzipped reads file reads/alz.polished.hq.fasta

Let me know how it goes.

ksahlin commented 2 years ago

On second thought, I don't think the readfq parser handles .gz files, so I would try idea 2 first. It is probably what is causing the error.

unique379r commented 2 years ago

Hey Kristoffer You are correct, your fq parser was not able to deal with gz file, so i tried simple fasta. however i got the error about 'mummer' step.

Please take a look error log:

Traceback (most recent call last):
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/site-packages/modules/mem_wrapper.py", line 31, in find_mems_slamem
    subprocess.check_call([ 'slaMEM', '-l' , str(min_mem),  refs_path, read_path, '-o', out_path ], stdout=stdout_file, stderr=stderr_file)
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/subprocess.py", line 368, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/subprocess.py", line 349, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/subprocess.py", line 1821, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'slaMEM'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/rupesh/Apps/envs/ultra/bin/uLTRA", line 722, in <module>
    align_reads(args)
  File "/scratch/rupesh/Apps/envs/ultra/bin/uLTRA", line 395, in align_reads
    mem_wrapper.find_mems_slamem(args.outfolder, args.reads_tmp, ref_path, mummer_out_path, args.min_mem)
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/site-packages/modules/mem_wrapper.py", line 34, in find_mems_slamem
    find_mems_mummer(outfolder, read_path, refs_path, out_path, min_mem)
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/site-packages/modules/mem_wrapper.py", line 16, in find_mems_mummer
    subprocess.check_call([ 'mummer',   '-maxmatch', '-l' , str(min_mem),  refs_path, read_path], stdout=output_file, stderr=null)
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/subprocess.py", line 368, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/subprocess.py", line 349, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/scratch/rupesh/Apps/envs/ultra/lib/python3.9/subprocess.py", line 1821, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mummer'

ksahlin commented 2 years ago

The conda installation should install slaMEM for you, but it seems slaMEM is not installed, or at least it is not found on your machine.

Did you run conda activate [name of your ultra env] after the conda installation?

Otherwise, you can manually install slaMEM easily by:

git clone git@github.com:fjdf/slaMEM.git
cd slaMEM
make

and then put the resulting binary file slaMEM in your path.

Best, /K

ksahlin commented 2 years ago

You don't need to worry about the mummer. It's a fall-back call if slaMEM returns an error. In this case the error was that slaMEM is not found. If you followed the bioconda installation it should have installed it for you. My bet is that you, perhaps, followed only some of the steps in the manual installation.

unique379r commented 2 years ago

Hi there, I guess, i got it and its running now... Few more Q:

Can uLTRA accepts CCS bam (HIFI pacbio) reads as input ? Since help says its required fasta/fastq which i dont have though i have HQ fasta and fastq from isoseq3. is this uLTRA expect from isoseq ? or bam2fastq of CCS reads can be used as input?
The gtf input i am guessing its exons only not the genecode fully gtf with gene, exons etc, right ?
After the mapping or pipeline output as bam, do you suggest to go isoseq clustering ?

Rupesh Kesharwani

ksahlin commented 2 years ago

Great that you got it running!

No, but HQ fastq from isoseq3 is fine, or alternatively, simply get the fastq reads from the bam as you write (e.g., with bam2fastq).
No, uLTRA assumes it is the "full" gtf with gene, transcript, and exon information (such as for the gencode gtfs).
Depends on what you want to do after mapping, but in general no. isoseq clustering is usually a reference-free analysis step. You are mapping to a reference, and can therefore use a reference based software (depending on what you want to do). TAMA, FLAIR, SQANTI, talon comes to mind. There are more tools though.

Best, K

ksahlin / ultra

ultra installation and run error #10