raphael-group / MERLIN

7 stars 0 forks source link

Possible improvements/suggestions #1

Closed Jay-Leung closed 3 months ago

Jay-Leung commented 4 months ago

Hello there,

Thanks for this method - it's definitely useful to help infer cell lineage and mito clonal evolution at the same time. I had some single cell ATAC & mito data and wanted to try this out. However, there were quite a number of issues while I was trying to run MERLIN.

  1. Gurobi academic license The first error I faced was that the number of optimizations exceeded the allowed usage for gurobipy. It required a license which I eventually managed to get from gurobi. Might be worth mentioning that this is required?

  2. Package versions When I tried to run MERLIN again with the liceense the next error I had was something like

    Traceback (most recent call last):
    File "/home/users/astar/gis/leungjy1/Github/MERLIN/src/merlin_edit.py", line 304, in <module>
    main()
    File "/home/users/astar/gis/leungjy1/Github/MERLIN/src/merlin_edit.py", line 302, in main
    set_ILP_formuation(G_ancestry, cell2group, df_variant, df_total, args.out)
    File "/home/users/astar/gis/leungjy1/Github/MERLIN/src/merlin_edit.py", line 278, in set_ILP_formuation
    df_u = pd.DataFrame(np.reshape(model.getAttr('x', u).values(), (ncells, nmutations)),
    File "/home/users/astar/gis/leungjy1/.conda/envs/mquad/lib/python3.10/site-packages/numpy/_core/fromnumeric.py", line 299, in reshape
    return _wrapfunc(a, 'reshape', newshape, order=order)
    File "/home/users/astar/gis/leungjy1/.conda/envs/mquad/lib/python3.10/site-packages/numpy/_core/fromnumeric.py", line 54, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
    File "/home/users/astar/gis/leungjy1/.conda/envs/mquad/lib/python3.10/site-packages/numpy/_core/fromnumeric.py", line 46, in _wrapit
    result = getattr(arr, method)(*args, **kwds)
    ValueError: cannot reshape array of size 1 into shape (50,5)

I solved this by editing the merlin.py lines 240 to 245 to:

    a_values = np.array(list(model.getAttr('x', a).values()))
    logging.info(f'a_values length: {len(a_values)}')
    logging.info(f'a_values: {a_values[:10]}')  # Log the first 10 values for inspection

    if len(a_values) == ngroups * nmutations:
        df_a = pd.DataFrame(np.reshape(a_values, (ngroups, nmutations)).astype(int),
                            columns=df_cluster_total.index, index=df_cluster_total.columns)
        df_a.to_csv(f'{prefix}_Amatrix.csv')

    else:
        logging.error(f'Error reshaping array: cannot reshape array of size {len(a_values)} into shape ({ngroups},{nmutations})')

    u_values = np.array(list(model.getAttr('x', u).values()))

    logging.info(f'u_values length: {len(u_values)}')
    logging.info(f'u_values: {u_values[:10]}')  # Log the first 10 values for inspection

    if len(u_values) == ncells * nmutations:
        df_u = pd.DataFrame(np.reshape(u_values, (ncells, nmutations)),
                            columns=df_cluster_total.index, index=df_cluster_total.columns)
        df_u.to_csv(f'{prefix}_Umatrix.csv')

    else:
        logging.error(f'Error reshaping array: cannot reshape array of size {len(u_values)} into shape ({ncells},{nmutations})')

I am not sure whether this is because the installed versions of numpy differs. But I thought perhaps it might be better to document the package versions used.

I am using python 3.10,

numpy 2.0.1
pandas 2.2.2
netwrokx 3.3
  1. Example data & MKN matrixes

When i ran the example data it worked well. I did note that the format was cell by mutation csv, which the code transposes to a mutation by cell before downstream ancestry graph and ILP. I ran the same code on the MKN data to see if it works on a larger real life dataset, but realised that the output was not correct. My bad for not checking the format, it was because the data was arranged in a mutation by cell csv instead. So the transposition had to be removed, but I think it may be better if the input data was standardized.

Also I am in the midst of running MERLIN on MKN data but it timed out - think it might be worth documenting that the TimeLimit parameter should be adjusted for the ILP step?

Thanks a lot for this method once again, just dropping my suggestions. Do correct me if I have done anything wrong!

Regards, Jay

Jay-Leung commented 4 months ago

Another error that occured was division by 0 if the dot product of vaf1 and vaf1 was = 0.

I added in these lines into merlin.py after line 74:

        if np.dot(vaf1, vaf1) == 0:
            logging.warning(f"Dot product of vaf1 is zero for mutations {mut1} and {mut2}. Skipping these mutations.")
            continue
ViolaChenYT commented 3 months ago

thank you for your numerous great suggestions! I have updated and incorporated those, and I'll look into the issue you raised in the other github issue and get back to you.