koszullab / GRAAL

(check out instaGRAAL for a faster, updated program!) This program is from Marie-Nelly et al., Nature Communications, 2014 (High-quality genome assembly using chromosomal contact data), also Marie-Nelly et al., 2013, PhD thesis (https://www.theses.fr/2013PA066714)
https://research.pasteur.fr/fr/software/graal-software-for-genome-assembly-from-chromosome-contact-frequencies/
14 stars 9 forks source link

AttributeError: 'int' object has no attribute 'astype' #14

Open bgbrink opened 6 years ago

bgbrink commented 6 years ago

When I run GRAAL, I receive the following error:

Processing...
Description: convert dense file to COO sparse data.
Done.
start filtering
nfrags =  [95581]
n init frags =  [95581]
mean sparsity =  0.0021264316
std sparsity =  0.0022989216
max_sparsity =  0.059509736
cleaning : start
number of fragments to remove =  0
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "main_window.py", line 85, in run
    pyramid = pyr.build_and_filter(self.base_folder, self.size_pyramid, self.factor)
  File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 69, in build_and_filter
    current_abs_fragments_contacts, pyramid_0)
  File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 756, in remove_problematic_fragments
    p.render(pt ,'step %s\nProcessing...\nDescription: removing bad fragments.' % step)
  File "/home/benedikt/Python/GRAAL/progressbar.py", line 61, in render
    self.progress = (bar_width * percent.astype(np.int)) / 100
AttributeError: 'int' object has no attribute 'astype'

The input data has been generated using the HiC-Box (thanks again for your help there).

Most likely unrelated: the stdout is spammed with this message as well:

*** BUG ***
In pixman_region32_init_rect: Invalid rectangle passed
Set a breakpoint on '_pixman_log_error' to debug
baudrly commented 6 years ago

For some reason 'percent' (which used to be an int32 np object a few commits ago) is now a plain python int. Could you try replacing the incriminating line with:

self.progress = (bar_width * np.int32(percent)) / 100

and let me know if it fixes the issue? Thanks.

bgbrink commented 6 years ago

Yup that fixed it, thanks. But now I have a new error:

Processing...
Description: removing bad fragments.
max new id =  95582
update contacts files...
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "main_window.py", line 85, in run
    pyramid = pyr.build_and_filter(self.base_folder, self.size_pyramid, self.factor)
  File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 69, in build_and_filter
    current_abs_fragments_contacts, pyramid_0)
  File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 818, in remove_problematic_fragments
    new_abs_id_frag_a = old_2_new_frags[fa + 1] # old_2_new_frags 1-based index
KeyError: -2

Also, could you explain somewhere what the paramater "Size of the pyramid" does? You only say which one to use for the respective examples, but you don't give a reasoning. Does this parameter depend on the genome size?

baudrly commented 6 years ago

Yup that fixed it, thanks. But now I have a new error:

Could you post a couple lines from your abs_fragments_contacts_weighted.txt file? It should really just be a sparse matrix in edge list format, with the first two columns representing the source and target nodes (fragments) and the third one representing the edges (number of contacts). Is there anything out of the ordinary with that file?

Also, could you explain somewhere what the paramater "Size of the pyramid" does?

It sets the size of your bins. The contact map is recursively sum-pooled for every level in the pyramid. The bigger the size, the larger the size of the bins GRAAL is going to work with.

Does this parameter depend on the genome size?

Yes, you don't want the matrix to be too large as it may take too long to converge (not to mention memory issues with your graphic card), but you don't want it to be too small either, as it will limit the possible operations on your genome (and opportunities for correction). From experience, on a GeForce GTX TITAN Z, I've found maps of size ranging from 1000 to 10000 bins to give pretty good results. It can be hard to gauge the right level at first since the size of bins depends on the restriction map of the genome, but when in doubt I'd start with the highest level first and climb down as needed.

bgbrink commented 6 years ago

Thanks for the explaination, very helpful! You should consider adding it to the readme.md.

I had a quick look at the abs_fragments_contacts_weighted.txt and it looks fine to me:

id_frag_a   id_frag_b   n_contact
0   1   19
0   2   2
0   5   1
0   10  2
0   23  2
0   24  1
0   105 1
0   113 1
0   155 1
baudrly commented 6 years ago

I'm not sure what's causing the KeyError. Don't you have any negative numbers hanging in your file? If you don't, could you send me the following:

so I can test and see what's wrong?

bgbrink commented 6 years ago

I just did a grep -oP -- '-\d+' abs_fragments_contacts_weighted.txt; and did not receive any results, so there shouldn't be any negative values. You can download the files here: https://www.dropbox.com/s/rwnil9as8h8v2v6/fragment_files.tar.gz?dl=1

baudrly commented 6 years ago

Alright it runs fine on my machine, which may suggest I'm using a more up to date version. I added a branch called 'develop', could you try running it on your dataset?

bgbrink commented 6 years ago

Edit: I realized I never sent you the genome.fasta, which I previously had been selecting as the Fasta file. But since you don't have it, it can't be necessary so I tried it without, see below. Could you clarify what should be selected under "Load Fasta File"?

bgbrink commented 6 years ago

I looked a bit more into this this morning. I ran everything again from scratch without loading a Fasta File and this is what I got.

Master branch Computation runs fine up until here:

here we go
92.65692814897916% ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□ step 9000000
Processing...
Description: convert dense file to COO sparse data.
Done.
Start filling the pyramid
here we go
92.65692814897916% ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□ step 9000000
Processing...
Description: loading sparse data into hdf5.
Done.
pyramid built.
here we go
92.65692814897916% ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□ step 9000000
Processing...
Description: convert dense file to COO sparse data.
Done.
start filtering
nfrags =  [95581]
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "main_window.py", line 85, in run
    pyramid = pyr.build_and_filter(self.base_folder, self.size_pyramid, self.factor)
  File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 69, in build_and_filter
    current_abs_fragments_contacts, pyramid_0)
  File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 585, in remove_problematic_fragments
    sparse_mat_csr = sp.csr_matrix((np_2_scipy_sparse[2,:], np_2_scipy_sparse[0:2,:]), shape=(nfrags, nfrags))
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 51, in __init__
    other = self.__class__(coo_matrix(arg1, shape=shape))
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 191, in __init__
    self._check()
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 241, in _check
    raise ValueError('negative row index found')
ValueError: negative row index found

Develop Branch I had to make the following changes in order to get this version to run:

  1. File "main_window.py", line 9, changed wxversion.select("2.8") to wxversion.select("3.0")
  2. File "pyramid_sparse.py", changed all occurrences of p = ProgressBar('green', width=20, block='▣', empty='□') to p = ProgressBar('green', width=20, block='|', empty='-') in order to avoid SyntaxError: Non-ASCII character
  3. Files "simulation_loader.py" and "cuda_lib_gl.py", changed import Image to from PIL import Image

However, the end result is exactly the same, computation runs fine until ValueError.

baudrly commented 6 years ago

Hello, sorry for the delay. The fasta file should be the reference genome you used to map the reads onto. When I said 'it runs fine', I means it successfully computes the entirety of the pyramid and stores it in memory. You still have to load the reference genome afterwards though, since it will be used to generate a new fasta file from the 'building blocks' being swapped, flipped, merged, etc.