logix-project / logix

AI Logging for Interpretability and Explainability🔬
Apache License 2.0
74 stars 6 forks source link

Optimization for compute_influence part #95

Closed eatpk closed 4 months ago

eatpk commented 5 months ago

Summary

Optimization for the influence function tgt indexer when running with 10 src_ids and 10,000,000 tgt_ids.

This change will make twice as faster.

Note since Pandas runs above Numpy, it is faster than the native python list[] or even dictionary {}. We can further optimize this using only c-python I think.

I also tested that {} search operation was slower than influence_scores.index.get_loc(src_id) as well as creating the dictionary map

Related Issues

Stated above.

Test Plan

https://colab.research.google.com/drive/18_M8zA8RB-CDZ7jfUOFELS3DK35yaMlE#scrollTo=GNVYnz-Uo2VS The operation got twice as faster.

eatpk commented 5 months ago

I also search polars, but this does not have indexing, which isn't a good pratice for our case.

eatpk commented 5 months ago

@sangkeun00

sangkeun00 commented 4 months ago

@eatpk Sorry for the late response. I am currently trying to redesign InfluenceFunction, and made it directly output influence results in a Python dictionary format of {"src_ids": List[str], "tgt_ids": List[str], "influence": 2D-Tensor} to avoid the pandas DataFrame bottleneck while ensuring some structure in the returned output. Therefore, I am trying to close this PR, but wanted to check with you first. Let me know your thoughts.

eatpk commented 4 months ago

@sangkeun00 I think I mislead you with the previous conversation, Pandas is faster than native Python list as it was implemented using numpy(which ustilizes C/C++), I have experimented that pandas is faster, we can make it faster if we use numpy array instead..

sangkeun00 commented 4 months ago

Can you provide more information on that (if possible with reference)? Previously, the main bottleneck is reindexing columns and rows every time compute_influence is called. Specifically, runtime for this operation increases linearly every iteration. However, with the current implementation, it only requires appending to the existing list, which is pretty fast.

eatpk commented 4 months ago

If this can be done without reindexing, that's can be done in that way, however, referring here, https://colab.research.google.com/drive/18_M8zA8RB-CDZ7jfUOFELS3DK35yaMlE#scrollTo=GNVYnz-Uo2VS

the main bottle neck of this operation was

  src_indices = [
      influence_scores.index.get_loc(src_id) for src_id in src_ids
  ]
  tgt_indices = [
      influence_scores.columns.get_loc(tgt_id) for tgt_id in tgt_ids
  ]

this part. If we utilize the get indexer of pandas, it got 2 times faster.

eatpk commented 4 months ago

From the printer logs of the first code block, you can see that re-indexing took only 1.5 sec but the c3 part, which is the indexer part took most of the time. c1 1.5796821117401123 c2 0.00014448165893554688 c3 26.81981635093689 c4 1.6497035026550293

sangkeun00 commented 4 months ago

Oh right. I apologize I abused the term reindexing for all pandas index related operations. As you can see, runtime for get_loc linearly increases with the number of columns. If get_loc to be called every time compute_influence is called, the total time complexity will be O(n^2), which should be avoided. One option you can think of is to accumulate influence scores as I suggest, and only at the end of compute_influece_all, we put that in the pandas dataframe. However, this is a pretty straightforward operation, and I was not sure if we really need to support it as a part of the library, or just remove pandas from our dependencies.

Also, you mentioned that numpy arrays are faster than Python list, but lists in my implementation are for src_ids and tgt_ids, which correspond to rows and columns. For influence scores, we still use torch.Tensor.

eatpk commented 4 months ago

Ahhhhh okay. I understand now.

Ah first things first, I am good with closing the PR! Will close it myself.(Sorry I missed this sentence last night)

second, I misunderstood what you meant there, I think it is good with list as it will deal with the src_ids, hit me with the PR when you implement it, I will review and test! @sangkeun00