Open HeuristicLab-Trac-Bot opened 4 years ago
- Added double vectors for
Dataset
. Extended the type-checks forDataAnalysisProblemData
.- Added a small benchmark instance with data containing vectors. Adapted the
ArtificialRegressionDataDescriptor
to be able to specify non-double values.Additional thoughts:
- Consider
ModifiableDataset
and DataPreprocessing.- Consider adding generic vector capabilities to IDataset that only allows double, string, DateTime.
- Consider changing
IList
within theDataset
to a covariant alternative (non-genericIReadOnlyList
does not exist, however). Currently the type must be exactlyIReadOnlyList<double>
, otherwise the invariantIList<T>
is not a subtype ofIList<IList<double>>
for instance.- Each DataAnalysis algorithm should check on it's own, whether the types of the allowed input variables is compatible. For instance, the LR would only allow double-values, whereas SymReg also supports string-variables (as factor variables) and double-vector-variables.
r17365 Added explicit vector types to avoid type-missmatches when representing vectors as IList
, List or IReadOnlyList . Additional toughts:
- The
IDataset
interface (and its implementation) now contains a lot of methods due to all the different available types (double, string, DateTime and also vector-versions). In the future, this should be unified.- Whether the types of the input variables are allowed should be decided by the algorithms, rather than the
ProblemData
.
r17369 Added Vector symbols to
TypeCoherentExpressionGrammar
& fixes.
r17401 Added parser for new benchmark data but did not commit the data yet (too large)
r17414 Started adding UCI time series regression benchmarks. Adapted parser (extracted format options & added parsing for double vectors).
r17415 Added additional UCI instances for time series regression
r17416 enabled variable impacts for vectorial data (if vectors have the same length)
- (partially) enabled data preprocessing for vectorial data
- use flat zip-files for large benchmarks instead of embedded resources (faster build times)
- added multiple variants of vector benchmark I (vector lenght constraints)
r17448 Replaced own Vector with MathNet.Numerics Vector.
- Used types are not yet storable.
- I do not like the
using DoubleVector = MathNet.Numerics.LinearAlgebra.Vector<double>;
directive. Maybe Ill switch tousing MathNet.Numerics.LinearAlgebra.Single;
and only useVector
as type.
r17449 Added Transformers for Vectors. Added specialiced Transformers for double Dense/SparseVectorStorage and a generic mapper for the remaining (serializable) types.
r17452 Improved Persistence for Vectors (removed the generic transformer and used the existing array transformer instead).
r17455 Added separate Interpreter for vector that reuse the existing symbols instead of creating explicit vector symbols.
- Added full functional grammar for vectors.
- Added sum and mean aggregation for vectors.
r17463 Added type coherent vector grammar to enforce that the root symbol is a scalar.
r17466 Added separate mean symbol instead of reusing the average symbol.
r17467 Added a "final aggregation" option for the vector interpreter in case the result is a vector.
r17469 Added TensorFlow.NET library for constant optimization with vectors (as alternative to AutoDiff+Alglib).
The build process for TensorFlow.NET is somewhat tedious for multiple reasons:
- First, the NumSharp dependency for TensorFlow.NET is not strongly named, thus cannot be loaded with HL.
- The native tensorflow.dll does not ship correctly with the Framework edition (on dotnet core it works).
- A newer version of Google.Protobuf is required.
Due to the reasons above, the following steps were taken to import TensorFlow.NET:
- All dependencies for Google.Protobuf are upgraded to 3.11.4. This includes HEAL.Attic, which is manually built and then replaces the binaries in the bin. A manual build of HEAL.Attic is currently required anyway, because the
BoxTransformer
is still internal in the latest Nuget release but already fixed in the Master branch.- Since the OR-Tools (for exact optimization) includes already built assemblies referencing the old Google.Protobuf version, they are currently excluded. Also I removed the HeuristicLab.ProtobufCS-2.4.1 version to avoid any further conflicts. Therefore, external evaluation and some other plugins do not work on this branch.
- Although TensorFlow.NET is strongly named, it's dependency NumSharp is not. Simply signing NumSharp did not work, because the reference from TensorFlow.NET expects an unsigned NumSharp assembly. As a solution, there is a standalone project within the Extlibs (
TensorFlowNet
) that references the Nuget package for TensorFlow.NET and uses ILMerge (also via Nuget package) to create a single assembly, containing both TensorFlow.NET.dll and the NumSharp.dll, and signed with the HL key. The resulting TensorFlow.NET.signed.dll is (file-) referenced within the transport pluginHeuristicLab.TensorFlowNet
.- The native tensorflow.dll is located within a separate nuget redist package. However, this does not work for dotnet framework for some reason. I created a dotnet core project with the redist package referenced, and copied the native x64 dll from there into the
HeuristicLab.TensorFlowNet
transport plugin as native dll plugin dependency.As a final note: The whole build process is instable. Sometimes the resulting TensorFlow.Signed.dll contains some unloadble types. Clearing the bin folder, praying to ILMerge and the build gods usually helps.
r17472 Moved Alglib+AutoDiff constant optimizer in own class and created base class to provide multiple constant-opt implementations.
r17475 Updated HeuristicLab.Algorithms.DataAnalysis plugin and its dependencies to Framework 4.7.2 to avoid conflicting System.ValueTuple locations (mscorelib or nuget).
r17489 Added version with explicit array shapes for explicit broadcasting.
- Switched whole TF-graph to float (Adam optimizer won't work with double).
- Added progress and cancellation support for TF-const opt.
- Added optional logging with console and/or file for later plotting.
r17556 Some corner cases for empty or length-one vectors now return NaN.
r17573 Added first draft for
WindowedSymbol
.ToDo:
- Make other aggregation symbols windowed
- Better encoding for
Offset
andLength
parameter
- Continuous interpretation (e.g. weighted sum/mean)
- Mutation is currently not symmetric (due to cast/floor mechanic when calculating the actual indices)
- Create a test function specifically for benchmarking windowed symbols
- Evaluate alternative: explicit "SubVector" symbol?
- No continuous interetation
- Potential issues with incompatible vector lengths
- Adapted existing benchmarks (no mean/sum of vectors with zero-mean).
- Added new benchmark for testing windowed aggregations.
r17593 Added a new simplifier that can also simplify vector-specific operators.
- Added simplification rules for sum-symbol and mean-symbol for addition and multiplication
r17596: added subtraction/division simplification for sum and mean symbols by converting them to sums/products.
- Changed stddev, variance, etc. to population variant
- Added multiplicative simplifications for stdev and variance symbols
- Added additive simplification rules for stdev and variance symbols.
- Extended simplifications of constants to simplification of all scalar-nodes for aggregation symbols.
r17604 Stores the datatype of a tree node (e.g. variable nodes) in the tree itself for the interpreter to derive the datatypes for subtrees. This way, the interpreter (and simplifier) do not need an actual dataset to figure out datatypes for subtrees.
- Extended importer (vectorvariable, vec-aggregations, ...).
- Started adding unit test for vector simplifications.
- Switched vector-simplification unit-test to infix notation to avoid ambiguities between the peek-string "VAR" for variables and the variance function.
- Added additional unit tests for mean, length, stdev and var simplifications.
r17626 Unified simplification rules for vector aggregation functions.
Issue migrated from trac ticket # 3040
component: Problems.DataAnalysis.Symbolic | priority: medium
2019-11-21 13:20:22: @NimZwei created the issue