graph-genome / component_segmentation

Read in ODGI Bin output and identify co-linear components
Apache License 2.0
3 stars 4 forks source link

Optimizations #40

Closed dimatr closed 4 years ago

dimatr commented 4 years ago

With all the changes the 6.2 GB Athaliana json is read and processed within <= 8 min eating up to 30 GB runtime memory (test VM with 28 cores and 64 GB RAM)

josiahseaman commented 4 years ago

When you're ready to merge, please pull in the latest from master. I'm still seeing conflicts:

matrixcomponent/JSONparser.py
matrixcomponent/PangenomeSchematic.py
matrixcomponent/matrix.py
segmentation.py
subwaystation commented 4 years ago

Seems like you need to update the tests @dimatr ?

(base) ubuntu@pantograph2:~/software/component_segmentation/git/dimatr_optimizations$ pytest matrixcomponent/tests.py 
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.4, pytest-5.4.1, py-1.8.0, pluggy-0.13.0
rootdir: /mnt/vol1/software/component_segmentation/git/dimatr_optimizations
plugins: openfiles-0.4.0, remotedata-0.3.2, arraydiff-0.3, doctestplus-0.4.0
collected 5 items                                                                                                                                                                                                                            

matrixcomponent/tests.py ...FF                                                                                                                                                                                                         [100%]

================================================================================================================== FAILURES ==================================================================================================================
______________________________________________________________________________________________________________ test_find_groups ______________________________________________________________________________________________________________

    def test_find_groups():
        data = np.array([
            [1, 2], [1, 2], [1, 3],
            [2, 1], [2, 1], [2, 1], [2, 2],
            [3, 3], [3, 3], [3, 4], [3, 4], [3, 5]
        ])
        assert np.array_equal(find_groups(data[:,0], data[:,1]),
                              [(0, 2), (2, 3),
                               (3, 6), (6, 7),
                               (7, 9), (9, 11), (11, 12)])
        data = np.array([[]])
>       assert np.array_equal(find_groups(data[:, 0], data[:, 1]), [])
E       IndexError: index 0 is out of bounds for axis 1 with size 0

matrixcomponent/tests.py:56: IndexError
_______________________________________________________________________________________________________ test_sort_and_drop_duplicates ________________________________________________________________________________________________________

    def test_sort_and_drop_duplicates():
        df = dict({
            "from":       [1, 3, 2, 3, 2, 0, 5, 4, 1, 2],
            "to":         [2, 2, 4, 2, 3, 1, 4, 3, 2, 3],
            "path_index": [0, 3, 1, 2, 2, 3, 2, 1, 3, 2],
        })

        expected = dict({
            "from":       [0, 1, 1, 2, 2, 3, 3, 4, 5],
            "to":         [1, 2, 2, 3, 4, 2, 2, 3, 4],
            "path_index": [3, 0, 3, 2, 1, 2, 3, 1, 2],
        })  # only one duplicate (2, 3, 2)

>       assert np.array_equal(sort_and_drop_duplicates( [np.concatenate( (df['from'], df['to'], df['path_index']) )], expected) )

matrixcomponent/tests.py:75: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

connections = [array([1, 3, 2, 3, 2, 0, 5, 4, 1, 2, 2, 2, 4, 2, 3, 1, 4, 3, 2, 3, 0, 3,
       1, 2, 2, 3, 2, 1, 3, 2])], shift = {'from': [0, 1, 1, 2, 2, 3, ...], 'path_index': [3, 0, 3, 2, 1, 2, ...], 'to': [1, 2, 2, 3, 4, 2, ...]}
path_shift = 10

    def sort_and_drop_duplicates(connections: 'List[np.array]', shift=21, path_shift=10) -> dict:
        '''
        returns connections sorted by ["from", "to", "path_index"] without duplicate entries;
        see find_dividers in segmentation.py
        '''
>       mask = (1 << shift) - 1
E       TypeError: unsupported operand type(s) for <<: 'int' and 'dict'

matrixcomponent/utils.py:85: TypeError
========================================================================================================== short test summary info ===========================================================================================================
FAILED matrixcomponent/tests.py::test_find_groups - IndexError: index 0 is out of bounds for axis 1 with size 0
FAILED matrixcomponent/tests.py::test_sort_and_drop_duplicates - TypeError: unsupported operand type(s) for <<: 'int' and 'dict'
======================================================================================================== 2 failed, 3 passed in 1.53s =========================================================================================================
subwaystation commented 4 years ago

I think we have to add psutil in the requirements.txt.