Python implementation is slow

Ichoran commented 8 years ago

In incidental testing with files based upon XJ30_NaCl500mM4uL6h_10m45x10s40s_Ea.wcon, a 75 MB wcon file, the Scala and Julia implementations complete in a few seconds, while the Python completes in a few minutes (~60-90x slower). This poses barriers for the analysis of large data sets.

If there are easy ways to speed up this performance, it is probably advisable. If not, it may be worth investigating an alternate implementation for large data sets that wraps one of the other implementations (e.g. the yet-to-be-created C or Rust implementations).

MichaelCurrie commented 8 years ago

Saving a 75 MB wcon file took 20+ minutes on my quad core machine, which is unacceptably slow, I agree.

I will start by profiling _data_segment_as_odict, which is the method that is taking up all the time. I suspect it's because I'm not pre-allocating memory for jagged_array.

MichaelCurrie commented 8 years ago

I discovered using mpstat -P ALL 2 while running WCON test.py that only one core gets used. So renting a 32-core instance on AWS is definitely a waste of dollars...
I used line_profiler to discover that the .loc DataFrame views in _data_segment_as_odict were creating copies of tables rather than views:

File: ../../tracker-commons/src/Python/wcon/wcon_data.py
Function: _data_segment_as_odict at line 613

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   613                                           @profile
   614                                           def _data_segment_as_odict(worm_id, df_segment):
   615                                               """
   616                                               Convert a pandas dataframe into an ordered_dictionary.  This is a support
   617                                               method of data_as_array.
   618
   619                                               """
   620         1           54     54.0      0.0      print("entering _data_segment_as_odict")
   621         1            4      4.0      0.0      start_time = time.monotonic()
   622         1            4      4.0      0.0      gc.disable()
   623         1            2      2.0      0.0      data_segment = [("id", worm_id)]
   624
   625         1            2      2.0      0.0      print("TIMEPOINT 1: %.2f seconds" %
   626         1           26     26.0      0.0            (time.monotonic() - start_time))
   627         1            3      3.0      0.0      start_time = time.monotonic()
   628
   629
   630                                               # We only care about time indices for which we have some "aspect" or
   631                                               # else which have some non-aspect
   632         1         3361   3361.0      0.0      df_segment = df_segment[(~df_segment.isnull()).any(axis=1)]
   633
   634                                               # We must make the array "jagged" according to the "aspect size",
   635                                               # since aspect size may differ frame-by-frame
   636         1          993    993.0      0.0      worm_aspect_size = df_segment.loc[:, ('aspect_size')]
   637
   638         1           28     28.0      0.0      data_segment.append(('t', list(df_segment.index)))
   639
   640         1          350    350.0      0.0      keys_used = [k for k in df_segment.columns.get_level_values('key')
   641                                                            if k != 'aspect_size']
   642
   643         1            3      3.0      0.0      print("TIMEPOINT 2: %.2f seconds" %
   644         1           36     36.0      0.0            (time.monotonic() - start_time))
   645         1            3      3.0      0.0      start_time = time.monotonic()
   646
   647                                               # e.g. ox, oy, head, ventral
   648         1          152    152.0      0.0      for key in [k for k in keys_used if k in elements_without_aspect]:
   649                                                   cur_segment_slice = df_segment.loc[:, idx[key, 0]]
   650                                                   # We must replace NaN with None, otherwise the JSON encoder will
   651                                                   # save 'NaN' as the string and this will get rejected by our schema
   652                                                   # on any subsequent loads
   653                                                   # Note we can't use .fillna(None) due to this issue:
   654                                                   # https://github.com/pydata/pandas/issues/1972
   655                                                   if key in ['head', 'ventral']:
   656                                                       cur_segment_slice = \
   657                                                           cur_segment_slice.where(pd.notnull(cur_segment_slice), None)
   658                                                   cur_list = list(np.array(cur_segment_slice))
   659                                                   data_segment.append((key, cur_list))
   660
   661         1            2      2.0      0.0      print("TIMEPOINT 3: %.2f seconds" %
   662         1           28     28.0      0.0            (time.monotonic() - start_time))
   663         1            3      3.0      0.0      start_time = time.monotonic()
   664
   665         1            1      1.0      0.0      accumulated_time0 = 0
   666         1            1      1.0      0.0      accumulated_time1 = 0
   667         1            1      1.0      0.0      accumulated_time2 = 0
   668         1            2      2.0      0.0      accumulated_time3 = 0
   669
   670                                               # e.g. x, y
   671       503         1104      2.2      0.0      for key in [k for k in keys_used if k in elements_with_aspect]:
   672       502         1119      2.2      0.0          st = time.monotonic()
   673       502       433404    863.4      0.3          non_jagged_array = df_segment.loc[:, (key)]
   674       502         2150      4.3      0.0          accumulated_time0 += time.monotonic() - st
   675
   676       502          998      2.0      0.0          jagged_array = []
   677
   678    101404       248991      2.5      0.2          for t in non_jagged_array.index:
   679                                                       # If aspect size isn't defined, don't bother adding data here:
   680    100902     45335029    449.3     34.0              if np.isnan(worm_aspect_size.loc[t, 0]):
   681                                                           jagged_array.append([])
   682                                                       else:
   683    100902       365646      3.6      0.3                  st = time.monotonic()
   684
   685    100902     44400822    440.0     33.3                  cur_aspect_size = int(worm_aspect_size.loc[t, 0])
   686    100902       403096      4.0      0.3                  accumulated_time1 += time.monotonic() - st
   687                                                           # For some reason loc's slice notation is INCLUSIVE!
   688                                                           # so we must subtract one from cur_aspect_size, so if
   689                                                           # it's 3, for instance, we get only entries
   690                                                           # 0, 1, and 2, as required.
   691    100902       198099      2.0      0.1                  if cur_aspect_size == 0:
   692     27108        53826      2.0      0.0                      cur_entry = []
   693                                                           else:
   694     73794       154050      2.1      0.1                      st = time.monotonic()
   695     73794     36917435    500.3     27.7                      cur_entry = non_jagged_array.loc[t, 0:cur_aspect_size - 1]
   696     73794       297507      4.0      0.2                      accumulated_time2 += time.monotonic() - st
   697     73794       163166      2.2      0.1                      st = time.monotonic()
   698     73794      3883578     52.6      2.9                      cur_entry = list(np.array(cur_entry))
   699     73794       280115      3.8      0.2                      accumulated_time3 += time.monotonic() - st
   700
   701
   702    100902       248648      2.5      0.2                  jagged_array.append(cur_entry)
   703
   704       502         1248      2.5      0.0          data_segment.append((key, jagged_array))
   705
   706         1            2      2.0      0.0      print("TIMEPOINT 4: %.2f seconds" %
   707         1           75     75.0      0.0            (time.monotonic() - start_time))
   708         1            3      3.0      0.0      start_time = time.monotonic()
   709
   710         1            2      2.0      0.0      print("accumulatedtime0: %.2f seconds" %
   711         1           22     22.0      0.0            (accumulated_time0))
   712
   713         1            2      2.0      0.0      print("accumulatedtime1: %.2f seconds" %
   714         1           18     18.0      0.0            (accumulated_time1))
   715
   716         1            2      2.0      0.0      print("accumulatedtime2: %.2f seconds" %
   717         1           15     15.0      0.0            (accumulated_time2))
   718
   719         1            2      2.0      0.0      print("accumulatedtime3: %.2f seconds" %
   720         1           15     15.0      0.0            (accumulated_time3))
   721
   722         1            7      7.0      0.0      gc.enable()
   723
   724         1         2604   2604.0      0.0      return OrderedDict(data_segment)

MichaelCurrie commented 8 years ago

Full 4500+ frame file with garbage collection 1744.49 seconds = ~29 minutes Full 4500+ frame file with garbage collection disabled during crucial parts: 1546.59 seconds = ~26 minutes

I profiled with just the first 200 frames so I didn't have to wait so long:

with garbage collection disabled: 67.69 seconds
after moving worm_aspect_size.loc[t, 0] to be outside the main t loop: 27.84 seconds
after first creating the jagged array = non_jagged_array.values.tolist() then slicing it down, rather than iterating through each time frame, slicing the dataframe, converting it to a list, an appending it to the jagged array: 1.60 seconds

These are in commit 16b1bd4126e165fe953c47f70c4555746e7d47a8.

Now profiling the full file is more feasible.

Full file with the above performance improvements to _data_segment_as_odict: 38.53 seconds (a 40x speedup)

New profile:

Total time: 35.8752 s
File: ../../tracker-commons/src/Python/wcon/wcon_data.py
Function: _data_segment_as_odict at line 612

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   612                                           @profile
   613                                           def _data_segment_as_odict(worm_id, df_segment):
   614                                               """
   615                                               Convert a pandas dataframe into an ordered_dictionary.  This is a support
   616                                               method of data_as_array.
   617
   618                                               """
   619                                               # Disable garbage collection to speed up list append operations
   620         1           11     11.0      0.0      gc.disable()
   621         1            2      2.0      0.0      data_segment = [("id", worm_id)]
   622
   623                                               # We only care about time indices for which we have some "aspect" or
   624                                               # else which have some non-aspect
   625         1        42170  42170.0      0.1      df_segment = df_segment[(~df_segment.isnull()).any(axis=1)]
   626
   627                                               # We must make the array "jagged" according to the "aspect size",
   628                                               # since aspect size may differ frame-by-frame
   629         1         1164   1164.0      0.0      worm_aspect_size = df_segment.loc[:, ('aspect_size')]
   630
   631         1          255    255.0      0.0      data_segment.append(('t', list(df_segment.index)))
   632
   633         1          381    381.0      0.0      keys_used = [k for k in df_segment.columns.get_level_values('key')
   634                                                            if k != 'aspect_size']
   635
   636                                               # e.g. ox, oy, head, ventral
   637         1          139    139.0      0.0      for key in [k for k in keys_used if k in elements_without_aspect]:
   638                                                   cur_segment_slice = df_segment.loc[:, idx[key, 0]]
   639                                                   # We must replace NaN with None, otherwise the JSON encoder will
   640                                                   # save 'NaN' as the string and this will get rejected by our schema
   641                                                   # on any subsequent loads
   642                                                   # Note we can't use .fillna(None) due to this issue:
   643                                                   # https://github.com/pydata/pandas/issues/1972
   644                                                   if key in ['head', 'ventral']:
   645                                                       cur_segment_slice = \
   646                                                           cur_segment_slice.where(pd.notnull(cur_segment_slice), None)
   647                                                   cur_list = list(np.array(cur_segment_slice))
   648                                                   data_segment.append((key, cur_list))
   649
   650         1          993    993.0      0.0      num_spine_points = worm_aspect_size.loc[:, 0].astype(int).tolist()
   651
   652                                               # e.g. x, y
   653       503         1376      2.7      0.0      for key in [k for k in keys_used if k in elements_with_aspect]:
   654       502       602137   1199.5      1.7          non_jagged_array = df_segment.loc[:, (key)]
   655
   656       502     19032958  37914.3     53.1          jagged_array = non_jagged_array.values.tolist()
   657
   658       502         5012     10.0      0.0          num_time_points = len(jagged_array)
   659
   660   2330786      3514033      1.5      9.8          for i in range(num_time_points):
   661   2330284      3610922      1.5     10.1              cur_aspect_size = num_spine_points[i]
   662
   663   2330284      9060164      3.9     25.3              jagged_array[i] = jagged_array[i][:cur_aspect_size]
   664
   665       502         3190      6.4      0.0          data_segment.append((key, jagged_array))
   666
   667                                               # Re-enable garbage collection now that all our list append
   668                                               # operations are done.
   669         1           15     15.0      0.0      gc.enable()
   670
   671         1          234    234.0      0.0      return OrderedDict(data_segment)

MichaelCurrie commented 8 years ago

Next steps:

Now try loading this WCON file I've saved. The loading code will likely need the same 40x speedup treatment just given to the saving code.
Consider adding multicore support via simple multithreading (e.g. http://stackoverflow.com/questions/23537037/)

MichaelCurrie commented 8 years ago

So, reading the whole 4500+ frame file takes 73.81 seconds.

12% of the time is spent in _validate_time_series_data 88% in _obtain_time_series_data_frame

Zooming in on _obtain_time_series_data_frame, sadly nothing is obviously capable of optimization. (see below)

Instead, I think adding multicore support is probably the best approach.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   256                                           @profile
   257                                           def parse_data(data):
   258                                               """
   259                                               Parse the an array of entries conforming to the WCON schema definition
   260                                               for the "data" array in the root object.  The canonical example is
   261                                               that of worm "skeleton" (midline) information over time.
   262
   263                                               This could be the standard "data" array from the root object, or
   264                                               some custom array that needs to be processed
   265
   266                                               Note that all elements are required to have "id", "t", and "x" and "y"
   267                                               entries.
   268
   269                                               Return an object encapsulating that data.
   270
   271                                               """
   272                                               # If data is single-valued, wrap it in a list so it will be just
   273                                               # a special case of the array case.
   274         2            7      3.5      0.0      if isinstance(data, dict):
   275         1            1      1.0      0.0          data = [data]
   276
   277         2           41     20.5      0.0      data = np.array(data)
   278
   279                                               # We only care about data that have all the mandatory fields.
   280                                               # the Custom Feature Type 2 objects are suppresed by this filter:
   281         2          111     55.5      0.0      is_time_series_mask = get_mask(data, 't')
   282         2           37     18.5      0.0      has_id_mask = get_mask(data, 'id')
   283         2           60     30.0      0.0      data = data[is_time_series_mask & has_id_mask]
   284
   285                                               # Clean up and validate all time-series data segments
   286         2        86335  43167.5     12.0      _validate_time_series_data(data)
   287
   288         2       631829 315914.5     87.9      time_df = _obtain_time_series_data_frame(data)
   289
   290         2            5      2.5      0.0      return time_df

ubuntu@ip-172-31-9-94:~/github/open-worm-analysis-toolbox/examples$ kernprof -l WCON\ demo.py
RUNNING TEST WCON demo.py:
Time elapsed: 38.36 seconds
NOW, let's LOAD this file!
Loading file: testfile2.wcon
Time elapsed: 73.18 seconds
Wrote profile results to WCON demo.py.lprof
ubuntu@ip-172-31-9-94:~/github/open-worm-analysis-toolbox/examples$ python -m line_profiler WCON\ demo.py.lprof
Timer unit: 1e-06 s

Total time: 0.74199 s
File: ../../tracker-commons/src/Python/wcon/wcon_data.py
Function: _obtain_time_series_data_frame at line 291

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   291                                           @profile
   292                                           def _obtain_time_series_data_frame(time_series_data):
   293                                               """
   294                                               Obtain a time-series pandas DataFrame
   295
   296                                               Parameters
   297                                               ------------
   298                                               time_series_data: list
   299                                                   All entries must be lists of dicts, all of which
   300                                                   must have 'id' and 't'
   301
   302                                               Returns
   303                                               ----------
   304                                               pandas dataframe
   305                                                   Time-series data goes into this Pandas DataFrame
   306                                                   The dataframe will have t as index, and multilevel columns
   307                                                   with id at the first level and all other keys at second level.
   308
   309                                               """
   310
   311                                               # Our DataFrame to return
   312         2            9      4.5      0.0      time_df = None
   313
   314                                               # Consider only time-series data stamped with an id:
   315         4           31      7.8      0.0      for data_segment in time_series_data:
   316                                                   # Add this data_segment to a Pandas dataframe
   317         2            6      3.0      0.0          segment_id = data_segment['id']
   318         2           44     22.0      0.0          segment_keys = np.array([k for k in data_segment.keys()
   319                                                                            if k not in ['t', 'id']])
   320
   321         2          827    413.5      0.1          cur_timeframes = np.array(data_segment['t']).astype(float)
   322
   323                                                   # Create our column names as the cartesian product of
   324                                                   # the segment's keys and the id of the segment
   325                                                   # Only elements articulated across the skeleton, etc, of the worm
   326                                                   # have the "aspect" key though
   327                                                   cur_elements_with_aspect = \
   328         2           46     23.0      0.0              [k for k in elements_with_aspect if k in segment_keys]
   329         2            5      2.5      0.0          cur_elements_without_aspect = ['aspect_size'] + \
   330         2           47     23.5      0.0              [k for k in elements_without_aspect if k in segment_keys]
   331
   332                                                   # We want to be able to fit the largest aspect size in our
   333                                                   # DataFrame
   334         2         4575   2287.5      0.6          max_aspect_size = int(max([k[0] for k in data_segment['aspect_size']]))
   335
   336         2           34     17.0      0.0          key_combos = list(itertools.product([segment_id],
   337         2            5      2.5      0.0                                              cur_elements_with_aspect,
   338         2          101     50.5      0.0                                              range(max_aspect_size)))
   339         2            8      4.0      0.0          key_combos.extend(list(itertools.product([segment_id],
   340         2            4      2.0      0.0                                                   cur_elements_without_aspect,
   341         2           10      5.0      0.0                                                   [0])))
   342
   343         2            5      2.5      0.0          column_type_names = ['id', 'key', 'aspect']
   344         2           47     23.5      0.0          cur_columns = pd.MultiIndex.from_tuples(key_combos,
   345         2         5968   2984.0      0.8                                                  names=column_type_names)
   346
   347                                                   # e.g. if this segment has only 'x', 'y', that's what we'll be
   348                                                   # looking to add to our dataframe data staging in the next step
   349         2            6      3.0      0.0          cur_data_keys = cur_elements_with_aspect + \
   350         2            6      3.0      0.0              cur_elements_without_aspect
   351
   352                                                   # We must pad the timeframes where the data doesn't have maximal
   353                                                   # aspect or else the concatenation step below will fail.
   354         6           18      3.0      0.0          for k in cur_elements_with_aspect:
   355     18572        47083      2.5      6.3              for i in range(len(cur_timeframes)):
   356                                                           data_segment[k][i] = (
   357     18568        46742      2.5      6.3                      data_segment[k][i] +
   358     18568       128965      6.9     17.4                      [np.NaN] * (max_aspect_size - len(data_segment[k][i])))
   359
   360                                                   # Stage the data for addition to our DataFrame.
   361                                                   # Shape KxI where K is the number of keys and
   362                                                   #                 I is the number of "aspects"
   363         2            8      4.0      0.0          cur_data = np.array(
   364         2           11      5.5      0.0              [np.concatenate(
   365                                                           [data_segment[k][i] for k in cur_data_keys]
   366         2       347316 173658.0     46.8              ) for i in range(len(cur_timeframes))]
   367                                                   )
   368
   369         2         2274   1137.0      0.3          cur_df = pd.DataFrame(cur_data, columns=cur_columns)
   370
   371         2          358    179.0      0.0          cur_df.index = cur_timeframes
   372         2           20     10.0      0.0          cur_df.index.names = 't'
   373
   374                                                   # We want the index (time) to be in order.
   375         2          358    179.0      0.0          cur_df.sort_index(axis=0, inplace=True)
   376
   377                                                   # Apparently multiindex must be sorted to work properly:
   378         2        67314  33657.0      9.1          cur_df.sort_index(axis=1, inplace=True)
   379
   380         2           22     11.0      0.0          if time_df is None:
   381         2            5      2.5      0.0              time_df = cur_df
   382                                                   else:
   383                                                       time_df = df_upsert(src=cur_df, dest=time_df)
   384
   385                                               # We want the index (time) to be in order.
   386         2           80     40.0      0.0      time_df.sort_index(axis=0, inplace=True)
   387
   388                                               # We have to do this because somehow with entire worms who have 'head' or
   389                                               # 'ventral' columns, all their columns (even the numeric ones)
   390                                               # have dtype = object !  We want dtype=float64 (or some float anyway)
   391                                               # (Ignore the FutureWarning here about convert_objects being deprecated,
   392                                               # because to_numeric only acts on Series and I don't know which ones
   393                                               # we need to convert.)
   394         2          201    100.5      0.0      with warnings.catch_warnings():
   395         2          232    116.0      0.0          warnings.filterwarnings(action="ignore", category=FutureWarning)
   396         2        53569  26784.5      7.2          time_df = time_df.convert_objects(convert_numeric=True)
   397
   398                                               # If 'head' or 'ventral' is NaN, we must specify '?' since
   399                                               # otherwise, when saving this object, to specify "no value" we would
   400                                               # have to create a new data segment with no 'head' column at all.
   401                                               # But since both 'head' and 'ventral' have this issue, we might need
   402                                               # segments with just 'head', just 'ventral', and both 'head' and
   403                                               # 'ventral', and none.  That's too unwieldy for a rare use case.
   404                                               # So instead we treat nan as '?'.
   405
   406                                               # We must replace NaN with None, otherwise the JSON encoder will
   407                                               # save 'NaN' as the string and this will get rejected by our schema
   408                                               # on any subsequent loads
   409                                               # Note we can't use .fillna(None) due to this issue:
   410                                               # https://github.com/pydata/pandas/issues/1972
   411         2          521    260.5      0.1      df_keys = set(time_df.columns.get_level_values('key'))
   412         6           17      2.8      0.0      for k in ['head', 'ventral']:
   413         4           10      2.5      0.0          if k in df_keys:
   414                                                       cur_slice = time_df.loc[:, idx[:, k, :]]
   415                                                       time_df.loc[:, idx[:, k, :]] = cur_slice.fillna(value=np.nan)
   416
   417                                               # Make sure aspect_size is a float, since only floats are nullable:
   418         2            5      2.5      0.0      if 'aspect_size' in df_keys:
   419                                                   time_df.loc[:, idx[:, 'aspect_size', :]] = \
   420         2        35071  17535.5      4.7              time_df.loc[:, idx[:, 'aspect_size', :]].astype(float)
   421
   422         2            6      3.0      0.0      return time_df

Ichoran commented 8 years ago

Is that the best way to use np.concatenate? What is its performance on sequential additions? I'd think you'd instead be after some sort of bulk operation. Also, the NaN-padding is surprisingly expensive. Any idea what fraction of values are NaN in the test file?

MichaelCurrie commented 8 years ago

@Ichoran : not sure. I'm trying a few other things first; thanks for your comments...

After playing with the multiprocessing module for a few days, it turned out I missed something obvious from the above profiler log:

Time elapsed: 73.18 seconds
Wrote profile results to WCON demo.py.lprof
ubuntu@ip-172-31-9-94:~/github/open-worm-analysis-toolbox/examples$ python -m line_profiler WCON\ demo.py.lprof
Total time: 0.74199 s
...

The function I was profiling actually took just 0.74 seconds out of the 73.18 seconds the whole process was taking. I was optimizing the wrong method.

As it turns out, after further profiling, the culprit, taking 99% of the time, was a call to the jsonschema.validate method. I'm going to create a replicable example and post it to the jsonschema repo. Validating against the schema is really an optional thing; if we are sure a file conforms then it is an unnecessary step.

MichaelCurrie commented 8 years ago

I sped up the saving code again (commit cf8051d6fa2db06e921fa8fe3a8bbe9b9078bb56)

It turns out the line

keys_used = [k for k in df_segment.columns.get_level_values('key') 
             if k != 'aspect_size']

... was not returning unique results, so while keys_used = ['x', 'y'] was what I wanted to be the typical result of that line, it was actually ending up with keys_used = ['x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'] or something, so the later filtering step was being looped over more times than necessary.

Now the saving the full example file takes 5.38 seconds (an improvement on the previous 38.53 seconds, and now a 324x speedup from the original 29 minutes.)

Ichoran commented 8 years ago

That's 5.38 seconds for the full 75 MB? That's reasonable! Good work getting it back into the realm of usability!

MichaelCurrie commented 8 years ago

Python performance summary

File used for test: tracker-commons/tests/from_schafer_lab.wcon Size: 59.3 MB (4641 frames, ~ 250 points x and 250 points y per frame) Zipped size: 5.57 MB

Instruction	Time (seconds)	`jsonschema.validate` used?	Cores used
WCONWorms.load_from_file	1.00	No	1
WCONWorms.load_from_file	74.29	Yes	1
WCONWorms.save_to_file	5.38	n/a	1

Information on computer used:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2800.062
BogoMIPS:              5600.12
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-3
$ uname -a
Linux ip-172-31-9-94 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ python
>>> import sys
>>> sys.version
'3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:16:01) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

Benchmark script (imports omitted):

JSON_path = '~/github/tracker-commons/tests/from_schaefer_lab'

start_time = time.monotonic()
w1 = WCONWorms.load_from_file(JSON_path,
                              validate_against_schema=False)
print("Time to load w1: " + str(time.monotonic() - start_time))

test_path = 'test.wcon'
start_time = time.monotonic()
w1.save_to_file(test_path, pretty_print=pretty)
print("Time to save w1: " + str(time.monotonic() - start_time))

Ichoran commented 8 years ago

Yikes. Can we turn validation off by default, or will that result in badly formed WCON files turning into badly formed data in memory without any good way to tell?

MichaelCurrie commented 8 years ago

I agree. Validation can be turned off - although I have the default to True if that's okay. I'm not sure what happens if we pass invalid WCON. It still has to be valid JSON, that much is still checked.

Ichoran commented 8 years ago

Can you maybe check what happens with valid JSON that is invalid WCON before you close this?

I'm really rather wary of providing something that takes 10x longer than it needs to, or of providing something that doesn't actually check that the WCON is valid. Some sort of very clear warning is needed in either case. This tends to be the kind of thing that is not obvious to a casual user and the first experience shouldn't be, "Well, I did all this work to switch to WCON and now it takes me 15x longer than with my format...no thanks!"

MichaelCurrie commented 8 years ago

I still haven't checked what happens with some invalid WCON files, but I did create this issue asking the jsonschema authors why validate is so slow:

https://github.com/Julian/jsonschema/issues/277

We can leave this issue open until I can speed up the validation step or I think more carefully about in what situations not validating might cause an issue.

MichaelCurrie commented 8 years ago

@Ichoran would you mind please adding the XJ30_NaCl500mM4uL6h_10m45x10s40s_Ea.wcon file you mentioned in your opening comment on this issue to the shared Google Drive data folder? That way it can be pulled down for automatic Travis CI testing.

Thanks

Ichoran commented 8 years ago

It's already on the drive. Feel free to move it around as makes sense for you.

MichaelCurrie commented 8 years ago

Great thanks. I've now moved it from "example_data/MWT Example Data" to "example_data/WCON" just because that's where the other files are.

Ichoran commented 8 years ago

Resolved by https://github.com/openworm/tracker-commons/pull/77

openworm / tracker-commons

Python implementation is slow #55