pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem
https://sparse.pydata.org
BSD 3-Clause "New" or "Revised" License
585 stars 125 forks source link

Reorganize coo.py? #112

Closed hameerabbasi closed 6 years ago

hameerabbasi commented 6 years ago

coo.py is getting large and unweildly. I sometimes spend a lot of time scrolling through it or finding stuff in it. I was thinking of breaking it up into smaller chunks for better maintainability.

I propose the following:

I love any proposed changes to this structure.

hameerabbasi commented 6 years ago

cc @mrocklin Feedback welcome on what the reorganization should be.

mrocklin commented 6 years ago

There are costs both ways. I've found that too many small files can also be a problem, both for navigability:

where was that function? in core.py or slicing.py?

and for imports:

wait, I now realize that I want some code from slicing.py in a module on which slicing.py depends. Crap. I guess I should move that function to a separate util.py from which they both import. Oh but then I need to change all of the current imports in the other files that already import it.

That being said, searching through multi-thousand-line files also isn't fun, especially for those of us that don't use more sophisticated IDEs. And splitting may make sense. I think that when to split and what to split is subjective. If I were doing this I might pull off the elemwise broadcasting logic (that seems big enough) but leave the other things

Move large implementations out of the COO class into methods

I hesitate to convert methods to functions just for the sake of code organization. This makes inspecting source code more challenging. I find myself frustrated when I look at the source code of objects from other projects using tools like IPython's ?? operator and learn that I now also need to navigate their file layout to get to the lines I care about.

hameerabbasi commented 6 years ago

I see. Agree that the other logic is not (yet) unwieldy enough to justify splitting, but it might be when we get to CSD (which will be a generalization of COO, CSC and CSR). I also plan to work on elemwise for DOK, and have alternative two implementions for elemwise for expert users (the other one being what @shoyer proposed). See my repo for more information. I plan to put it into SPEP format on the weekend.

mrocklin commented 6 years ago

To be clear, I'm not against splitting. My goal in the comment above is just to share feedback on cases where I learned that splitting introduced problems that I didn't expect.

hameerabbasi commented 6 years ago

I guess when something exceeds 250-ish lines should be a good idea?

mrocklin commented 6 years ago

Personally I'm very comfortable editing files that are greater than 250 lines. I think that this number tends to depend on how comfortable you are searching with a text editor or IDE. My comfort level is probably closer to 2000 lines in a file than 250. Looking at a couple other projects, here are some line counts

Pandas, maybe more monolithic

mrocklin@carbon:~/workspace/pandas/pandas/core$ wc -l *.py
    131 accessor.py
   1554 algorithms.py
     81 api.py
   1164 base.py
   2363 categorical.py
    701 common.py
    481 config_init.py
    838 config.py
     51 datetools.py
   6452 frame.py
   7292 generic.py
   4649 groupby.py
   2208 indexing.py
      3 index.py
      0 __init__.py
   5654 internals.py
    700 missing.py
    825 nanops.py
   1432 ops.py
     99 panel4d.py
    132 panelnd.py
   1633 panel.py
   1419 resample.py
   3243 series.py
    485 sorting.py
   1923 strings.py
   2089 window.py
  47602 total

Scikit-Learn, more hierarchical

mrocklin@carbon:~/workspace/scikit-learn/sklearn$ wc -l */*.py
     84 _build_utils/__init__.py
     46 __check_build/__init__.py
     18 __check_build/setup.py
    394 cluster/affinity_propagation_.py
    508 cluster/bicluster.py
    638 cluster/birch.py
    317 cluster/dbscan_.py
     81 cluster/_feature_agglomeration.py
    937 cluster/hierarchical.py
     36 cluster/__init__.py
   1597 cluster/k_means_.py
    422 cluster/mean_shift_.py
     54 cluster/setup.py
    482 cluster/spectral.py
    218 covariance/elliptic_envelope.py
    288 covariance/empirical_covariance_.py
    709 covariance/graph_lasso_.py
     34 covariance/__init__.py
    739 covariance/robust_covariance.py
    560 covariance/shrunk_covariance_.py
    107 cross_decomposition/cca_.py
      2 cross_decomposition/__init__.py
    881 cross_decomposition/pls_.py
    899 datasets/base.py
    138 datasets/california_housing.py
    123 datasets/covtype.py
    102 datasets/__init__.py
    380 datasets/kddcup99.py
    503 datasets/lfw.py
    114 datasets/mlcomp.py
    252 datasets/mldata.py
    147 datasets/olivetti_faces.py
    275 datasets/rcv1.py
   1714 datasets/samples_generator.py
     22 datasets/setup.py
    272 datasets/species_distributions.py
    481 datasets/svmlight_format.py
    372 datasets/twenty_newsgroups.py
    160 decomposition/base.py
   1327 decomposition/dict_learning.py
    349 decomposition/factor_analysis.py
    594 decomposition/fastica_.py
    282 decomposition/incremental_pca.py
     40 decomposition/__init__.py
    309 decomposition/kernel_pca.py
   1305 decomposition/nmf.py
    818 decomposition/online_lda.py
    838 decomposition/pca.py
     29 decomposition/setup.py
    304 decomposition/sparse_pca.py
    229 decomposition/truncated_svd.py
    997 ensemble/bagging.py
    161 ensemble/base.py
   1952 ensemble/forest.py
   2063 ensemble/gradient_boosting.py
    401 ensemble/iforest.py
     35 ensemble/__init__.py
    395 ensemble/partial_dependence.py
     17 ensemble/setup.py
    351 ensemble/voting_classifier.py
   1124 ensemble/weight_boosting.py
      7 externals/conftest.py
    815 externals/funcsigs.py
      5 externals/__init__.py
    498 externals/_pilutil.py
      9 externals/setup.py
    577 externals/six.py
    365 feature_extraction/dict_vectorizer.py
    172 feature_extraction/hashing.py
    519 feature_extraction/image.py
     13 feature_extraction/__init__.py
     19 feature_extraction/setup.py
     45 feature_extraction/stop_words.py
   1424 feature_extraction/text.py
    122 feature_selection/base.py
    201 feature_selection/from_model.py
     43 feature_selection/__init__.py
    450 feature_selection/mutual_info_.py
    473 feature_selection/rfe.py
    753 feature_selection/univariate_selection.py
     82 feature_selection/variance_threshold.py
    297 gaussian_process/correlation_models.py
    882 gaussian_process/gaussian_process.py
    741 gaussian_process/gpc.py
    473 gaussian_process/gpr.py
     23 gaussian_process/__init__.py
   1864 gaussian_process/kernels.py
     96 gaussian_process/regression_models.py
    571 linear_model/base.py
    566 linear_model/bayes.py
   2227 linear_model/coordinate_descent.py
    285 linear_model/huber.py
     86 linear_model/__init__.py
   1527 linear_model/least_angle.py
   1794 linear_model/logistic.py
    881 linear_model/omp.py
    422 linear_model/passive_aggressive.py
    132 linear_model/perceptron.py
    663 linear_model/randomized_l1.py
    504 linear_model/ransac.py
   1382 linear_model/ridge.py
    347 linear_model/sag.py
     48 linear_model/setup.py
   1363 linear_model/stochastic_gradient.py
    389 linear_model/theil_sen.py
     12 manifold/__init__.py
    221 manifold/isomap.py
    713 manifold/locally_linear.py
    431 manifold/mds.py
     37 manifold/setup.py
    539 manifold/spectral_embedding_.py
    897 manifold/t_sne.py
    124 metrics/base.py
   1996 metrics/classification.py
    123 metrics/__init__.py
   1409 metrics/pairwise.py
    828 metrics/ranking.py
    575 metrics/regression.py
    569 metrics/scorer.py
     32 metrics/setup.py
    503 mixture/base.py
    784 mixture/bayesian_mixture.py
    869 mixture/dpgmm.py
    749 mixture/gaussian_mixture.py
    853 mixture/gmm.py
     22 mixture/__init__.py
     59 model_selection/__init__.py
   1403 model_selection/_search.py
   2095 model_selection/_split.py
   1346 model_selection/_validation.py
    551 neighbors/approximate.py
    801 neighbors/base.py
    392 neighbors/classification.py
    178 neighbors/graph.py
     31 neighbors/__init__.py
    219 neighbors/kde.py
    352 neighbors/lof.py
    194 neighbors/nearest_centroid.py
    323 neighbors/regression.py
     41 neighbors/setup.py
    124 neighbors/unsupervised.py
    252 neural_network/_base.py
     15 neural_network/__init__.py
   1328 neural_network/multilayer_perceptron.py
    365 neural_network/rbm.py
    266 neural_network/_stochastic_optimizers.py
   3162 preprocessing/data.py
    184 preprocessing/_function_transformer.py
    379 preprocessing/imputation.py
     68 preprocessing/__init__.py
    837 preprocessing/label.py
    222 preprocessing/_target.py
     10 semi_supervised/__init__.py
    528 semi_supervised/label_propagation.py
    896 svm/base.py
     73 svm/bounds.py
   1160 svm/classes.py
     28 svm/__init__.py
     81 svm/setup.py
      0 tests/__init__.py
    452 tests/test_base.py
    306 tests/test_calibration.py
     14 tests/test_check_build.py
    160 tests/test_common.py
     68 tests/test_config.py
   1252 tests/test_cross_validation.py
    367 tests/test_discriminant_analysis.py
    166 tests/test_docstring_parameters.py
    622 tests/test_dummy.py
    815 tests/test_grid_search.py
    365 tests/test_impute.py
     20 tests/test_init.py
    463 tests/test_isotonic.py
    257 tests/test_kernel_approximation.py
     85 tests/test_kernel_ridge.py
    312 tests/test_learning_curve.py
    140 tests/test_metaestimators.py
    753 tests/test_multiclass.py
    493 tests/test_multioutput.py
    712 tests/test_naive_bayes.py
    986 tests/test_pipeline.py
    354 tests/test_random_projection.py
    477 tree/export.py
     13 tree/__init__.py
     39 tree/setup.py
   1465 tree/tree.py
     23 utils/arpack.py
     17 utils/bench.py
    178 utils/class_weight.py
    134 utils/deprecation.py
   2158 utils/estimator_checks.py
    766 utils/extmath.py
    297 utils/fixes.py
     84 utils/graph.py
    508 utils/__init__.py
    282 utils/linear_assignment_.py
    207 utils/metaestimators.py
     86 utils/mocking.py
    448 utils/multiclass.py
    204 utils/optimize.py
    199 utils/random.py
    508 utils/_scipy_sparse_lsqr_backport.py
     84 utils/setup.py
    470 utils/sparsefuncs.py
     25 utils/stats.py
    904 utils/testing.py
    224 utils/_unittest_backport.py
    868 utils/validation.py
 103260 total
hameerabbasi commented 6 years ago

Huh. I guess I'll follow your advice on splitting off elemwise but keep everything else together. I should invest some time in learning the navigation functions in PyCharm. Apparently, they're great. I can press keyboard shortcuts to navigate to definition, search for symbols; refactoring is great too.

Also, I meant a specific piece of functionality exceeding 250 (or 500) lines is when we should split it off. Or maybe the criteria should be when a main file exceeds 2000 lines, we should split off the biggest offender.

mrocklin commented 6 years ago

Ah, I see. I misinterpreted your comment as trying to keep file size below 250, which seemed small to me. I have no numeric guidelines on line length. Personally I usually make decisions based more on the logic of the function rather than the number of lines.