Closed dimatr closed 4 years ago
When you're ready to merge, please pull in the latest from master. I'm still seeing conflicts:
matrixcomponent/JSONparser.py
matrixcomponent/PangenomeSchematic.py
matrixcomponent/matrix.py
segmentation.py
Seems like you need to update the tests @dimatr ?
(base) ubuntu@pantograph2:~/software/component_segmentation/git/dimatr_optimizations$ pytest matrixcomponent/tests.py
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.4, pytest-5.4.1, py-1.8.0, pluggy-0.13.0
rootdir: /mnt/vol1/software/component_segmentation/git/dimatr_optimizations
plugins: openfiles-0.4.0, remotedata-0.3.2, arraydiff-0.3, doctestplus-0.4.0
collected 5 items
matrixcomponent/tests.py ...FF [100%]
================================================================================================================== FAILURES ==================================================================================================================
______________________________________________________________________________________________________________ test_find_groups ______________________________________________________________________________________________________________
def test_find_groups():
data = np.array([
[1, 2], [1, 2], [1, 3],
[2, 1], [2, 1], [2, 1], [2, 2],
[3, 3], [3, 3], [3, 4], [3, 4], [3, 5]
])
assert np.array_equal(find_groups(data[:,0], data[:,1]),
[(0, 2), (2, 3),
(3, 6), (6, 7),
(7, 9), (9, 11), (11, 12)])
data = np.array([[]])
> assert np.array_equal(find_groups(data[:, 0], data[:, 1]), [])
E IndexError: index 0 is out of bounds for axis 1 with size 0
matrixcomponent/tests.py:56: IndexError
_______________________________________________________________________________________________________ test_sort_and_drop_duplicates ________________________________________________________________________________________________________
def test_sort_and_drop_duplicates():
df = dict({
"from": [1, 3, 2, 3, 2, 0, 5, 4, 1, 2],
"to": [2, 2, 4, 2, 3, 1, 4, 3, 2, 3],
"path_index": [0, 3, 1, 2, 2, 3, 2, 1, 3, 2],
})
expected = dict({
"from": [0, 1, 1, 2, 2, 3, 3, 4, 5],
"to": [1, 2, 2, 3, 4, 2, 2, 3, 4],
"path_index": [3, 0, 3, 2, 1, 2, 3, 1, 2],
}) # only one duplicate (2, 3, 2)
> assert np.array_equal(sort_and_drop_duplicates( [np.concatenate( (df['from'], df['to'], df['path_index']) )], expected) )
matrixcomponent/tests.py:75:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
connections = [array([1, 3, 2, 3, 2, 0, 5, 4, 1, 2, 2, 2, 4, 2, 3, 1, 4, 3, 2, 3, 0, 3,
1, 2, 2, 3, 2, 1, 3, 2])], shift = {'from': [0, 1, 1, 2, 2, 3, ...], 'path_index': [3, 0, 3, 2, 1, 2, ...], 'to': [1, 2, 2, 3, 4, 2, ...]}
path_shift = 10
def sort_and_drop_duplicates(connections: 'List[np.array]', shift=21, path_shift=10) -> dict:
'''
returns connections sorted by ["from", "to", "path_index"] without duplicate entries;
see find_dividers in segmentation.py
'''
> mask = (1 << shift) - 1
E TypeError: unsupported operand type(s) for <<: 'int' and 'dict'
matrixcomponent/utils.py:85: TypeError
========================================================================================================== short test summary info ===========================================================================================================
FAILED matrixcomponent/tests.py::test_find_groups - IndexError: index 0 is out of bounds for axis 1 with size 0
FAILED matrixcomponent/tests.py::test_sort_and_drop_duplicates - TypeError: unsupported operand type(s) for <<: 'int' and 'dict'
======================================================================================================== 2 failed, 3 passed in 1.53s =========================================================================================================
I think we have to add psutil
in the requirements.txt
.
With all the changes the 6.2 GB Athaliana json is read and processed within <= 8 min eating up to 30 GB runtime memory (test VM with 28 cores and 64 GB RAM)