jbloomlab / dmslogo

draw sequence logos tailored to deep mutational scanning (DMS) data
GNU General Public License v3.0
12 stars 1 forks source link

Error "not sequential unbroken integers" for Line plot #20

Open yeli7068 opened 2 years ago

yeli7068 commented 2 years ago

Dear Dr. Bloom,

I tried the line plot in dmslogo with toydata.csv. Errors say "not sequential unbroken integers".

Then I turned to the example. Even after reading the instruction, I still felt confused especially there was a gap between original and new in BG505_to_HXB2.csv (e.g. site: 141, 142l isite:142, 151).

What is "not sequential unbroken integers"? How to get the isite in SARS2?

Thx in advance.

Codes here:

# load data
toydata = pd.read_csv("toydata.csv")

# logo plot check 
fig, ax = dmslogo.draw_logo(toydata.query('show_site'),
                            x_col='site',
                            letter_col='mutation',
                            letter_height_col='escape_score',
                            xtick_col='wt_site',
                            title='AZD8895',
                            addbreaks=False)

# line plot failed

fig, ax = dmslogo.draw_line(toydata,
                            x_col='site', # how to get the isite in SARS2?what is "not sequential unbroken integers"?
                            height_col='tot_escape_score',
                            xtick_col='site',
                            show_col='show_site',
                            title='AZD8895',
                            widthscale=2)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/mv/v7pv40mn6d3gwx8g563lpclm0000gn/T/ipykernel_14414/3092297124.py in <module>
----> 1 fig, ax = dmslogo.draw_line(toydata,
      2                             x_col='site', # how to get the isite in SARS2?what is "sequential unbroken integers"?
      3                             height_col='tot_escape_score',
      4                             xtick_col='site',
      5                             show_col='show_site',

~/anaconda3/envs/SARS2_RBD_Ab_escape_maps/lib/python3.8/site-packages/dmslogo/line.py in draw_line(data, x_col, height_col, height_col2, xtick_col, show_col, xlabel, ylabel, title, color, color2, show_color, linewidth, widthscale, heightscale, axisfontscale, hide_axis, ax, ylim_setter, fixed_ymin, fixed_ymax)
    162     if (xlen != data[x_col].nunique()) or any(list(range(xmin, xmax + 1)) !=
    163                                               data[x_col].unique()):
--> 164         raise ValueError('`x_col` not sequential unbroken integers')
    165 
    166     if len(data[x_col]) != len(data[x_col].unique()):

ValueError: `x_col` not sequential unbroken integers

OS: macOS Catalina 10.15.7 Python: 3.8.12 dmslogo: 0.6.2

jbloom commented 2 years ago

The line plot requires x_col to have sequential unbroken numbers, because the line plot draws a value for every site. The logo plot does not require this because it can break the axis to just show certain sites of interest.

The x_col (or isite) column can just be any index that goes 1, 2, 3, ... so on. If you are using a protein that is already numbered that way, then it is just the site. But some proteins are no longer sequentially numbered. For instance, Omicron has some indels in the NTD but is still normally numbered using Wuhan-Hu-1 site numbering.