salzman-lab / SICILIAN

GNU General Public License v2.0
19 stars 11 forks source link

Enhancement step: updating `light_utils.py` to run with current Pandas & Numpy when there are no junctions #26

Open fomightez opened 3 months ago

fomightez commented 3 months ago

Current Pandas (v 2.2.2) and Numpy (v 2.0.1) don't seem very compatible with the current light_utils.py, at least if there seem to be no splice junctions in a tiny aligned dataset. (Or maybe it is even a problem if the input data has some junctions; I'm still trying to sort that out.)

Trying to run the workflow with a tiny dataset is valid because you maybe are trying to sort out if you have all the dependencies installed and so want to be able to quickly step through the entire process. (Especially since the test data and associated annotation files, etc., that apparently were available before may not be accessible, see here?)

I have tried updating it. Here is what I have. You can see it in action by going here and clicking on 'launch Sicilian in JupyterLab' badge there. When the session comes up step through running all the cells. The final cell there at the end of that Jupyter Notebook presently includes running the modify_refnames() function that is in light_utils.py. You can swap in the original light_utils.py to see it will error out at line 38 in modify_refnames() with AttributeError: Can only use .str accessor with string values!.

The differences are shown here.
All but a couple of the changes are casting explicitly to string type before next chaining in using string methods or building a string. The other couple of changes address not setting on a copy by instead re-assigning the column back to CI_new without using inplace=True.

I will admit I am very unsure about my changes to what correspond to lines 118 and higher in the original light_utils.py. Unfortunately, I couldn't quite get a good situation set up quite yet to fully compare the action of the original and my modified lines with real data. But even the earlier lines needed some updating, and so I thought prompting a discussion at this point was still fine.

Even with earlier Pandas and Numpy ( v 1.5.1 & 1.23.5, respectively) I was seeing issues when I ran my simplistic test dataset with the original light_utils.py.

The current script I edited also works without error with those older versions of Pandas & Nunpy, too. It only shows a deprecation warning, which is moot since I specify the value of the regex argument. Here is that part of the run isolated to show the warning it gives:

started modify 0.007201433181762695
/home/jovyan/SICILIAN-binder/scripts/light_utils.py:53: FutureWarning: The default value of regex will change from True to False in a future version.
  CI_new.loc[ind,"geneR1" + suff] = CI_new.loc[ind,"geneR1" + suff].astype(str).str.replace("{}[^,]*[,]".format(weird_gene),"",regex=True).str.replace(",{}.*".format(weird_gene),"")  # added cast to str based on https://stackoverflow.com/a/52065957/8508004
ended modify 0.22089719772338867

(Practical reminder note to myself, I was able to test the original light_utils.py with older Pandas and Numpy, in conjunction with the simplistic demo by launching a Jupyter session from here and then cloning in my repo and then replacing the altered light_utils.py with the original. [Had to install pyarrow and pysam and specify to install STAR aligner 2.7.11 with%conda install -y bioconda::star=2.7.11, too.] That Binder example repo currently has older Pandas and Numpy, v 1.5.1 & 1.23.5, respectively.)