Closed hameerabbasi closed 6 years ago
cc @mrocklin Feedback welcome on what the reorganization should be.
There are costs both ways. I've found that too many small files can also be a problem, both for navigability:
where was that function? in core.py or slicing.py?
and for imports:
wait, I now realize that I want some code from slicing.py in a module on which slicing.py depends. Crap. I guess I should move that function to a separate util.py from which they both import. Oh but then I need to change all of the current imports in the other files that already import it.
That being said, searching through multi-thousand-line files also isn't fun, especially for those of us that don't use more sophisticated IDEs. And splitting may make sense. I think that when to split and what to split is subjective. If I were doing this I might pull off the elemwise broadcasting logic (that seems big enough) but leave the other things
Move large implementations out of the COO class into methods
I hesitate to convert methods to functions just for the sake of code organization. This makes inspecting source code more challenging. I find myself frustrated when I look at the source code of objects from other projects using tools like IPython's ??
operator and learn that I now also need to navigate their file layout to get to the lines I care about.
I see. Agree that the other logic is not (yet) unwieldy enough to justify splitting, but it might be when we get to CSD
(which will be a generalization of COO
, CSC
and CSR
). I also plan to work on elemwise
for DOK
, and have alternative two implementions for elemwise
for expert users (the other one being what @shoyer proposed). See my repo for more information. I plan to put it into SPEP format on the weekend.
To be clear, I'm not against splitting. My goal in the comment above is just to share feedback on cases where I learned that splitting introduced problems that I didn't expect.
I guess when something exceeds 250-ish lines should be a good idea?
Personally I'm very comfortable editing files that are greater than 250 lines. I think that this number tends to depend on how comfortable you are searching with a text editor or IDE. My comfort level is probably closer to 2000 lines in a file than 250. Looking at a couple other projects, here are some line counts
mrocklin@carbon:~/workspace/pandas/pandas/core$ wc -l *.py
131 accessor.py
1554 algorithms.py
81 api.py
1164 base.py
2363 categorical.py
701 common.py
481 config_init.py
838 config.py
51 datetools.py
6452 frame.py
7292 generic.py
4649 groupby.py
2208 indexing.py
3 index.py
0 __init__.py
5654 internals.py
700 missing.py
825 nanops.py
1432 ops.py
99 panel4d.py
132 panelnd.py
1633 panel.py
1419 resample.py
3243 series.py
485 sorting.py
1923 strings.py
2089 window.py
47602 total
mrocklin@carbon:~/workspace/scikit-learn/sklearn$ wc -l */*.py
84 _build_utils/__init__.py
46 __check_build/__init__.py
18 __check_build/setup.py
394 cluster/affinity_propagation_.py
508 cluster/bicluster.py
638 cluster/birch.py
317 cluster/dbscan_.py
81 cluster/_feature_agglomeration.py
937 cluster/hierarchical.py
36 cluster/__init__.py
1597 cluster/k_means_.py
422 cluster/mean_shift_.py
54 cluster/setup.py
482 cluster/spectral.py
218 covariance/elliptic_envelope.py
288 covariance/empirical_covariance_.py
709 covariance/graph_lasso_.py
34 covariance/__init__.py
739 covariance/robust_covariance.py
560 covariance/shrunk_covariance_.py
107 cross_decomposition/cca_.py
2 cross_decomposition/__init__.py
881 cross_decomposition/pls_.py
899 datasets/base.py
138 datasets/california_housing.py
123 datasets/covtype.py
102 datasets/__init__.py
380 datasets/kddcup99.py
503 datasets/lfw.py
114 datasets/mlcomp.py
252 datasets/mldata.py
147 datasets/olivetti_faces.py
275 datasets/rcv1.py
1714 datasets/samples_generator.py
22 datasets/setup.py
272 datasets/species_distributions.py
481 datasets/svmlight_format.py
372 datasets/twenty_newsgroups.py
160 decomposition/base.py
1327 decomposition/dict_learning.py
349 decomposition/factor_analysis.py
594 decomposition/fastica_.py
282 decomposition/incremental_pca.py
40 decomposition/__init__.py
309 decomposition/kernel_pca.py
1305 decomposition/nmf.py
818 decomposition/online_lda.py
838 decomposition/pca.py
29 decomposition/setup.py
304 decomposition/sparse_pca.py
229 decomposition/truncated_svd.py
997 ensemble/bagging.py
161 ensemble/base.py
1952 ensemble/forest.py
2063 ensemble/gradient_boosting.py
401 ensemble/iforest.py
35 ensemble/__init__.py
395 ensemble/partial_dependence.py
17 ensemble/setup.py
351 ensemble/voting_classifier.py
1124 ensemble/weight_boosting.py
7 externals/conftest.py
815 externals/funcsigs.py
5 externals/__init__.py
498 externals/_pilutil.py
9 externals/setup.py
577 externals/six.py
365 feature_extraction/dict_vectorizer.py
172 feature_extraction/hashing.py
519 feature_extraction/image.py
13 feature_extraction/__init__.py
19 feature_extraction/setup.py
45 feature_extraction/stop_words.py
1424 feature_extraction/text.py
122 feature_selection/base.py
201 feature_selection/from_model.py
43 feature_selection/__init__.py
450 feature_selection/mutual_info_.py
473 feature_selection/rfe.py
753 feature_selection/univariate_selection.py
82 feature_selection/variance_threshold.py
297 gaussian_process/correlation_models.py
882 gaussian_process/gaussian_process.py
741 gaussian_process/gpc.py
473 gaussian_process/gpr.py
23 gaussian_process/__init__.py
1864 gaussian_process/kernels.py
96 gaussian_process/regression_models.py
571 linear_model/base.py
566 linear_model/bayes.py
2227 linear_model/coordinate_descent.py
285 linear_model/huber.py
86 linear_model/__init__.py
1527 linear_model/least_angle.py
1794 linear_model/logistic.py
881 linear_model/omp.py
422 linear_model/passive_aggressive.py
132 linear_model/perceptron.py
663 linear_model/randomized_l1.py
504 linear_model/ransac.py
1382 linear_model/ridge.py
347 linear_model/sag.py
48 linear_model/setup.py
1363 linear_model/stochastic_gradient.py
389 linear_model/theil_sen.py
12 manifold/__init__.py
221 manifold/isomap.py
713 manifold/locally_linear.py
431 manifold/mds.py
37 manifold/setup.py
539 manifold/spectral_embedding_.py
897 manifold/t_sne.py
124 metrics/base.py
1996 metrics/classification.py
123 metrics/__init__.py
1409 metrics/pairwise.py
828 metrics/ranking.py
575 metrics/regression.py
569 metrics/scorer.py
32 metrics/setup.py
503 mixture/base.py
784 mixture/bayesian_mixture.py
869 mixture/dpgmm.py
749 mixture/gaussian_mixture.py
853 mixture/gmm.py
22 mixture/__init__.py
59 model_selection/__init__.py
1403 model_selection/_search.py
2095 model_selection/_split.py
1346 model_selection/_validation.py
551 neighbors/approximate.py
801 neighbors/base.py
392 neighbors/classification.py
178 neighbors/graph.py
31 neighbors/__init__.py
219 neighbors/kde.py
352 neighbors/lof.py
194 neighbors/nearest_centroid.py
323 neighbors/regression.py
41 neighbors/setup.py
124 neighbors/unsupervised.py
252 neural_network/_base.py
15 neural_network/__init__.py
1328 neural_network/multilayer_perceptron.py
365 neural_network/rbm.py
266 neural_network/_stochastic_optimizers.py
3162 preprocessing/data.py
184 preprocessing/_function_transformer.py
379 preprocessing/imputation.py
68 preprocessing/__init__.py
837 preprocessing/label.py
222 preprocessing/_target.py
10 semi_supervised/__init__.py
528 semi_supervised/label_propagation.py
896 svm/base.py
73 svm/bounds.py
1160 svm/classes.py
28 svm/__init__.py
81 svm/setup.py
0 tests/__init__.py
452 tests/test_base.py
306 tests/test_calibration.py
14 tests/test_check_build.py
160 tests/test_common.py
68 tests/test_config.py
1252 tests/test_cross_validation.py
367 tests/test_discriminant_analysis.py
166 tests/test_docstring_parameters.py
622 tests/test_dummy.py
815 tests/test_grid_search.py
365 tests/test_impute.py
20 tests/test_init.py
463 tests/test_isotonic.py
257 tests/test_kernel_approximation.py
85 tests/test_kernel_ridge.py
312 tests/test_learning_curve.py
140 tests/test_metaestimators.py
753 tests/test_multiclass.py
493 tests/test_multioutput.py
712 tests/test_naive_bayes.py
986 tests/test_pipeline.py
354 tests/test_random_projection.py
477 tree/export.py
13 tree/__init__.py
39 tree/setup.py
1465 tree/tree.py
23 utils/arpack.py
17 utils/bench.py
178 utils/class_weight.py
134 utils/deprecation.py
2158 utils/estimator_checks.py
766 utils/extmath.py
297 utils/fixes.py
84 utils/graph.py
508 utils/__init__.py
282 utils/linear_assignment_.py
207 utils/metaestimators.py
86 utils/mocking.py
448 utils/multiclass.py
204 utils/optimize.py
199 utils/random.py
508 utils/_scipy_sparse_lsqr_backport.py
84 utils/setup.py
470 utils/sparsefuncs.py
25 utils/stats.py
904 utils/testing.py
224 utils/_unittest_backport.py
868 utils/validation.py
103260 total
Huh. I guess I'll follow your advice on splitting off elemwise but keep everything else together. I should invest some time in learning the navigation functions in PyCharm. Apparently, they're great. I can press keyboard shortcuts to navigate to definition, search for symbols; refactoring is great too.
Also, I meant a specific piece of functionality exceeding 250 (or 500) lines is when we should split it off. Or maybe the criteria should be when a main file exceeds 2000 lines, we should split off the biggest offender.
Ah, I see. I misinterpreted your comment as trying to keep file size below 250, which seemed small to me. I have no numeric guidelines on line length. Personally I usually make decisions based more on the logic of the function rather than the number of lines.
coo.py
is getting large and unweildly. I sometimes spend a lot of time scrolling through it or finding stuff in it. I was thinking of breaking it up into smaller chunks for better maintainability.I propose the following:
coo
directory.coo/coo.py
(for theCOO
class)COO
class into methods, one forelemwise
, one forreduction
, one for__getitem__
, and one for other methods.DOK
.I love any proposed changes to this structure.