Open bgbrink opened 6 years ago
For some reason 'percent' (which used to be an int32 np object a few commits ago) is now a plain python int. Could you try replacing the incriminating line with:
self.progress = (bar_width * np.int32(percent)) / 100
and let me know if it fixes the issue? Thanks.
Yup that fixed it, thanks. But now I have a new error:
Processing...
Description: removing bad fragments.
max new id = 95582
update contacts files...
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "main_window.py", line 85, in run
pyramid = pyr.build_and_filter(self.base_folder, self.size_pyramid, self.factor)
File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 69, in build_and_filter
current_abs_fragments_contacts, pyramid_0)
File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 818, in remove_problematic_fragments
new_abs_id_frag_a = old_2_new_frags[fa + 1] # old_2_new_frags 1-based index
KeyError: -2
Also, could you explain somewhere what the paramater "Size of the pyramid" does? You only say which one to use for the respective examples, but you don't give a reasoning. Does this parameter depend on the genome size?
Yup that fixed it, thanks. But now I have a new error:
Could you post a couple lines from your abs_fragments_contacts_weighted.txt
file? It should really just be a sparse matrix in edge list format, with the first two columns representing the source and target nodes (fragments) and the third one representing the edges (number of contacts). Is there anything out of the ordinary with that file?
Also, could you explain somewhere what the paramater "Size of the pyramid" does?
It sets the size of your bins. The contact map is recursively sum-pooled for every level in the pyramid. The bigger the size, the larger the size of the bins GRAAL is going to work with.
Does this parameter depend on the genome size?
Yes, you don't want the matrix to be too large as it may take too long to converge (not to mention memory issues with your graphic card), but you don't want it to be too small either, as it will limit the possible operations on your genome (and opportunities for correction). From experience, on a GeForce GTX TITAN Z, I've found maps of size ranging from 1000 to 10000 bins to give pretty good results. It can be hard to gauge the right level at first since the size of bins depends on the restriction map of the genome, but when in doubt I'd start with the highest level first and climb down as needed.
Thanks for the explaination, very helpful! You should consider adding it to the readme.md.
I had a quick look at the abs_fragments_contacts_weighted.txt
and it looks fine to me:
id_frag_a id_frag_b n_contact
0 1 19
0 2 2
0 5 1
0 10 2
0 23 2
0 24 1
0 105 1
0 113 1
0 155 1
I'm not sure what's causing the KeyError. Don't you have any negative numbers hanging in your file? If you don't, could you send me the following:
abs_fragments_contacts_weighted.txt
fragments_list.txt
info_contigs.txt
so I can test and see what's wrong?
I just did a grep -oP -- '-\d+' abs_fragments_contacts_weighted.txt;
and did not receive any results, so there shouldn't be any negative values. You can download the files here: https://www.dropbox.com/s/rwnil9as8h8v2v6/fragment_files.tar.gz?dl=1
Alright it runs fine on my machine, which may suggest I'm using a more up to date version. I added a branch called 'develop', could you try running it on your dataset?
Edit: I realized I never sent you the genome.fasta, which I previously had been selecting as the Fasta file. But since you don't have it, it can't be necessary so I tried it without, see below. Could you clarify what should be selected under "Load Fasta File"?
I looked a bit more into this this morning. I ran everything again from scratch without loading a Fasta File and this is what I got.
Master branch Computation runs fine up until here:
here we go
92.65692814897916% ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□ step 9000000
Processing...
Description: convert dense file to COO sparse data.
Done.
Start filling the pyramid
here we go
92.65692814897916% ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□ step 9000000
Processing...
Description: loading sparse data into hdf5.
Done.
pyramid built.
here we go
92.65692814897916% ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□ step 9000000
Processing...
Description: convert dense file to COO sparse data.
Done.
start filtering
nfrags = [95581]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "main_window.py", line 85, in run
pyramid = pyr.build_and_filter(self.base_folder, self.size_pyramid, self.factor)
File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 69, in build_and_filter
current_abs_fragments_contacts, pyramid_0)
File "/home/benedikt/Python/GRAAL/pyramid_sparse.py", line 585, in remove_problematic_fragments
sparse_mat_csr = sp.csr_matrix((np_2_scipy_sparse[2,:], np_2_scipy_sparse[0:2,:]), shape=(nfrags, nfrags))
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 51, in __init__
other = self.__class__(coo_matrix(arg1, shape=shape))
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 191, in __init__
self._check()
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 241, in _check
raise ValueError('negative row index found')
ValueError: negative row index found
Develop Branch I had to make the following changes in order to get this version to run:
wxversion.select("2.8")
to wxversion.select("3.0")
p = ProgressBar('green', width=20, block='▣', empty='□')
to p = ProgressBar('green', width=20, block='|', empty='-')
in order to avoid SyntaxError: Non-ASCII characterimport Image
to from PIL import Image
However, the end result is exactly the same, computation runs fine until ValueError.
Hello, sorry for the delay. The fasta file should be the reference genome you used to map the reads onto. When I said 'it runs fine', I means it successfully computes the entirety of the pyramid and stores it in memory. You still have to load the reference genome afterwards though, since it will be used to generate a new fasta file from the 'building blocks' being swapped, flipped, merged, etc.
When I run GRAAL, I receive the following error:
The input data has been generated using the HiC-Box (thanks again for your help there).
Most likely unrelated: the stdout is spammed with this message as well: