ngs-doo / dsl-json

High performance JVM JSON library
https://dsl-platform.com
BSD 3-Clause "New" or "Revised" License
1.01k stars 106 forks source link

removing sensitive data from json form #277

Open mkeskells opened 2 months ago

mkeskells commented 2 months ago

Its a bit similar to the problem in https://github.com/ngs-doo/dsl-json/issues/266

I have two requirements to not expose sensitive data

  1. To hide some values entirely (e.g. a the value of a password field). This would be driven by the name of the field (but still retain the structure)
  2. To alter some values from text fields, e.g. the value contains some secret

Ideally we would do (2) based on data at runtime

I could imagine that we could do (1) by some annotation that runs at compile time (but I am new to this library), and currently this is also expected to be a runtime setting. Maybe it could be both

For (2) I would imagine that the easiest solution would be subclassing JsonWriter (but this is a final class).

Happy to work on this with someone if that helps.

My first thought is that if we could make JsonWriter non final, I could do (2) And then expose the field that we are writing (as a complile switch?) then this could do (1) as well, but I could cope without this probably

What are you thoughts

zapov commented 2 months ago

There are 2 serialization modes: all and "minimal" But minimal could/should be generalized to conditional one, as currently it has a hardcoded condition of value being a "default one". Ideally there could be custom expression of when the condition is met.

So what I would look into is introducing new CONDITIONAL ObjectFormatPolicy which would also require some evaluator on property itself when it would write out the value and when it would not. This could be some link to "static" method which would accept the json, writer and instance (plus probably method name for generalization).

Then gen code for "conditional/minimal" serialization would inject this hardcoded methods for evaluation instead. I'm assuming we would want to allow this custom method per property which would be used over object defined one when defined. Plus a way to make it always serialize without going through the code and checking it.

mkeskells commented 2 months ago

Hi here is a first stab at it

https://github.com/ngs-doo/dsl-json/pull/278

it would need more tests

Not sure that I have understood the relationship between the ObjectFormatPolicy and the setting in the DSL but this does I think work for me, and allows finer control

Happy to discuss

The are a few extra methods added to the JsonWriter, but the bulk of the change generates code like below

        private final com.dslplatform.json.PropertyAccessor<com.dslplatform.json.CombinedFormatTest.ImmutableComposite1> propertyAccessor = (instance, field) -> {
            switch (field.getName()) {
                case "x":
                    return instance.x;
                case "s":
                    return instance.s;
                case "d":
                    return instance.d;
                default:
                     throw new IllegalArgumentException("Unknown field " + field.getName());            }
        };

        private final java.util.List<com.dslplatform.json.PropertyInfo<com.dslplatform.json.CombinedFormatTest.ImmutableComposite1>> propertyInfos = java.util.List.of(
            new com.dslplatform.json.PropertyInfo("x", quoted_x),
            new com.dslplatform.json.PropertyInfo("s", quoted_s),
            new com.dslplatform.json.PropertyInfo("d", quoted_d));

        public void writeContentControlled(final com.dslplatform.json.JsonWriter originalWriter, final com.dslplatform.json.CombinedFormatTest.ImmutableComposite1 instance) {
            final com.dslplatform.json.JsonWriter.FilterInfo filterInfo = originalWriter.controlledFilterInfo(instance, com.dslplatform.json.CombinedFormatTest.ImmutableComposite1.class, propertyAccessor, propertyInfos);
            com.dslplatform.json.JsonWriter writer = originalWriter;
            for (com.dslplatform.json.PropertyInfo property : originalWriter.controlledStart(instance, com.dslplatform.json.CombinedFormatTest.ImmutableComposite1.class, propertyAccessor, propertyInfos, filterInfo)) {
                writer = originalWriter.controlledPrepareForProperty(instance, com.dslplatform.json.CombinedFormatTest.ImmutableComposite1.class, propertyAccessor, filterInfo, property, writer);
                if (writer != null) {
                    switch (property.getName()) {
                        case "x":
                            if (instance.x == null) writer.writeNull();
                            else com.dslplatform.json.NumberConverter.serialize(instance.x, writer);
                            break;
                        case "s":
                            if (instance.s == null) writer.writeNull();
                            else writer.serialize(instance.s, writer_s);
                            break;
                        case "d":
                            if (instance.d == null) writer.writeNull();
                            else com.dslplatform.json.CombinedFormatTest.DoubleConverter.JSON_WRITER().write(writer, instance.d);
                            break;
                        default:
                             throw new IllegalArgumentException("Unknown property " + property.getName());
                    }
                }
            }
            originalWriter.controlledFinished(instance, com.dslplatform.json.CombinedFormatTest.ImmutableComposite1.class, propertyAccessor, filterInfo, writer);
        }
zapov commented 2 months ago

I'll try to find some more time to review it this or next week, but on a first glance here are some comments:

All in all, you seem to be on a good track, but try to stick to this two rules: no allocation allowed by default and sealed and final should stay as such

mkeskells commented 2 months ago

Hi thanks for the comments. I don't see how we can accomplish this without subclassing JsonWriter. The writer is embedded in so much of the library, and generated code, there isn't a simple way to intercept and vet the data written that I can see. And that's without considering custom serialisers

I havent profiled it, but I don't think that there are allocations outside of the static initialiser. I saw a couple of fields (that should have been static) and have made them so. That was the intention anyway

I was trying to leave to current path unaffected to retain performance. Injecting some property into the JsonWriter to validate/tweak any update is slowing that path

mkeskells commented 2 months ago

the changes are incomplete and here https://github.com/ngs-doo/dsl-json/pull/278/commits/93e4f149d362912c7ceb8b8fa25b1fef10b2fc8d

adding this functionality without subclassing JsonWriter, and having a policy to allow filtering is more invasive, and the code is only vetting the key values, not any writes to the values. Doing the writes to the values would slow down the write path, and I would have to disambiguate the writes to be vetted from the writes that are 'safe'

the generated code is now a static block for all of the class data ( which would be used in any option I think

        private final static com.dslplatform.json.ClassInfo<com.dslplatform.json.CombinedFormatTest.ImmutableComposite1> classInfo;
        static {
            final com.dslplatform.json.PropertyAccessor<com.dslplatform.json.CombinedFormatTest.ImmutableComposite1> propertyAccessor = new com.dslplatform.json.PropertyAccessor<com.dslplatform.json.CombinedFormatTest.ImmutableComposite1>(com.dslplatform.json.CombinedFormatTest.ImmutableComposite1.class) {
                public Object getField(com.dslplatform.json.CombinedFormatTest.ImmutableComposite1 instance, com.dslplatform.json.PropertyInfo propertyInfo) {
                    switch (propertyInfo.getName()) {
                        case "x":
                            return instance.x;
                        case "s":
                            return instance.s;
                        case "d":
                            return instance.d;
                        default:
                            throw new IllegalArgumentException("Unknown field " + propertyInfo.getName());                  }
                }
            };

            java.util.List<com.dslplatform.json.PropertyInfo> propertyInfos;
            java.util.ArrayList<com.dslplatform.json.PropertyInfo> _propertyInfos = new java.util.ArrayList<com.dslplatform.json.PropertyInfo>(3);
            _propertyInfos.add(new com.dslplatform.json.PropertyInfo("x", new com.dslplatform.json.JsonAttributeInfo("x", 1, new String[] {}, true, com.dslplatform.json.JsonAttribute.IncludePolicy.NON_DEFAULT)));
            _propertyInfos.add(new com.dslplatform.json.PropertyInfo("s", new com.dslplatform.json.JsonAttributeInfo("s", 2, new String[] {}, true, com.dslplatform.json.JsonAttribute.IncludePolicy.NON_DEFAULT)));
            _propertyInfos.add(new com.dslplatform.json.PropertyInfo("d", new com.dslplatform.json.JsonAttributeInfo("d", 3, new String[] {}, false, com.dslplatform.json.JsonAttribute.IncludePolicy.NON_DEFAULT)));
            propertyInfos = java.util.Collections.unmodifiableList(_propertyInfos);

            classInfo = new com.dslplatform.json.ClassInfo<com.dslplatform.json.CombinedFormatTest.ImmutableComposite1>(com.dslplatform.json.CombinedFormatTest.ImmutableComposite1.class, com.dslplatform.json.CompiledJson.ObjectFormatPolicy.DEFAULT, propertyInfos, propertyAccessor);
        };

and the generated controlled writer looks like

        @Override public <CONTROL$INFO extends com.dslplatform.json.ControlInfo> boolean writeContentControlled(final com.dslplatform.json.JsonWriter writer, final com.dslplatform.json.CombinedFormatTest.ImmutableComposite1 instance, com.dslplatform.json.JsonControls<CONTROL$INFO> controls) {
            final CONTROL$INFO controlInfo = controls.controlledInfo(instance, classInfo);
            boolean hasWritten = false;
            for (com.dslplatform.json.PropertyInfo propertyInfo: controls.controlledProperties(instance, classInfo, controlInfo)) {
                switch (propertyInfo.getName()) {
                    case "x":
                    {
                        int[] value = instance.x;
                        switch (controls.shouldWrite(instance, classInfo, controlInfo, propertyInfo, writer, value, true, value != null)) {
                            case WRITE_NORMALLY:
                                writer.writeAscii(propertyInfo.getQuotedNameAndColon());
                                long positionBefore = writer.size();
                                if (value == null) writer.writeNull();
                                else com.dslplatform.json.NumberConverter.serialize(value, writer);
                                controls.afterPropertyWrite(instance, classInfo, controlInfo, propertyInfo, writer, positionBefore);
                                // expected fallthrough
                            case WRITTEN_DIRECTLY:
                                writer.writeByte((byte)','); hasWritten = true;
                                // expected fallthrough
                            case IGNORED:
                                break;
                        }
                        break;
                    }
                    case "s":
                    {
                        java.util.List value = instance.s;
                        switch (controls.shouldWrite(instance, classInfo, controlInfo, propertyInfo, writer, value, true, value != null)) {
                            case WRITE_NORMALLY:
                                writer.writeAscii(propertyInfo.getQuotedNameAndColon());
                                long positionBefore = writer.size();
                                if (value == null) writer.writeNull();
                                else writer.serialize(value, writer_s);
                                controls.afterPropertyWrite(instance, classInfo, controlInfo, propertyInfo, writer, positionBefore);
                                // expected fallthrough
                            case WRITTEN_DIRECTLY:
                                writer.writeByte((byte)','); hasWritten = true;
                                // expected fallthrough
                            case IGNORED:
                                break;
                        }
                        break;
                    }
                    case "d":
                    {
                        java.lang.Double value = instance.d;
                        switch (controls.shouldWrite(instance, classInfo, controlInfo, propertyInfo, writer, value, true, value != null)) {
                            case WRITE_NORMALLY:
                                writer.writeAscii(propertyInfo.getQuotedNameAndColon());
                                long positionBefore = writer.size();
                                if (value == null) writer.writeNull();
                                else com.dslplatform.json.CombinedFormatTest.DoubleConverter.JSON_WRITER().write(writer, value);
                                controls.afterPropertyWrite(instance, classInfo, controlInfo, propertyInfo, writer, positionBefore);
                                // expected fallthrough
                            case WRITTEN_DIRECTLY:
                                writer.writeByte((byte)','); hasWritten = true;
                                // expected fallthrough
                            case IGNORED:
                                break;
                        }
                        break;
                    }
                }
            }
            return hasWritten;
        }

I would appreciate comments on if this is the way that you want to proceed. I have left much of the other implementation in there still (I could not see a reason to remove it yet)

mkeskells commented 1 month ago

@zapov do you have any comments?

zapov commented 1 month ago

Hi, was a bit busy, but it should be better now.

So... I'm not a fan of introducing various structures into dsl-json which are meant for this logic. People have all kind of use cases and they should be managing the logic via their own data structures. So those ClassInfo and similar is not something I want to have in the library.

But... I think we should go a bit slower (not only because the PR was too large - would much prefer 2 PRs: one for rename and other for logic)... so lets try to cover the basics first.

Can you provide some usage examples how people would configure that and use it within dsl-json (though pseudo-code, not via PR). When we agree on that we can move forward. Because for me... this should work similarly to current minimal serialization logic (in a sense that some method should be called to decide if property should be serialized or not). So for this, similarly how there is this converters and NamingStrategy... I would prefer external classes are referenced which can then be used in generated code and will take care of logic for serialization.

So, first of all.. is the goal of this to have serialization which is:

or something in between?

So for me... currently you have this logic in generated code

class Instance { public int x; }

void serialize(Instance instance) {
  if (instance.x != 0) { // which is default value
    //serialize x
  }
}

I would prefer implementation which mostly does in this case something along the lines of

void serialize(Instance instance) {
  if (ExternalClassWithVisibilityRules.shouldSerialize(instance, "x", dsl)) { // which is default value
    //serialize x
  }
}

I'm even ok if there is an option to have that in class (eg when external serializer is not defined) - which is aligned with naming strategy

interface JsonCanSerialize {
      boolean canSerialize(String propertyName, DslJson dslJson);
}

void serialize(Instance instance) {
  if (instance.canSerialize("x", dsl)) {
    //serialize x
  }
}

In the end... what seems to me that it makes sense is that you are able to define this rule on the property itself in a similar manner (but we can leave that for later)

mkeskells commented 1 month ago

Hi Good to have you back!

I only introduced the ClassInfo to make my life easier, and to avoid having to pass lots of parameters in lots of places, while we are agreeing the interface to change. I am not wedded to that, but we don't want to have something that has 5 parameters in lots of places

So the requirements that I have as that the redaction/filtering of the data must be at run time (we can do some compile time filtering as well), but we have to be able to determine that we have to hide some data that is leaking and remediate that, without having to re-release code.

The data configuring what to look for and how to interact with replacing that is regex expressions, and some constant strings

We have full control of the DSL and the environment that is used for this writing, and we don't need to make these setting affect all json serialisation on that JVM. Its does have to be thread safe, as used in logging code

There are 2 types of data that we want to filter

  1. data where the field name indicates that the value is something that should be be redacted, e.g. password, cardNumber, security.*, pin etc
  2. values that contain data that looks to be something that should not be kept e.g. BEGIN.*PRIVATE KEY-----.*END.*PRIVATE KEY-----

We generally want to leave a marker to say that the data was redacted, rather then just hiding it, and downstream processes then kick in and we can do the compile time filtering potentially, in the slower cycle on the next release

We need to maintain the json structure as valid in the output, clearly

So in some cases we want to ignore a key/value entirely, and in some cases we want to write a different value, with some/all of the text changed, so its harder than just canSerialise/shouldSerialise in your examples above, that's why I had the enum return saying what to write, and to control if the value was written already and you need to add a ','

Because you don't want to subclass the JsonWriter I need to retain information about what is checked and what isn't, to cope with serialising a map as a value, when I need to check that values written, or when there is a custom serialiser etc, ie when a value being written is more that a single string, number etc, or I need to intercept those writes as well, which seems hard to do. I still think that subclassing JsonWriter would be simpler. You could still seal it so that the only subclasses are one with no controls and one with delegated controls, and some associated, pluggable visibility rules driving that

You asked to have this implementation leveraging the minimum logic, which meant that the logic need to determine if the value is default (null, o etc) without boxing, which causes some bloating of the interface

The reordering the fields that I built isn't a requirement, but it some of came for free with the other infra that I built, but we can drop that

Agreed that it would be good to have this on a property level, but that is a custom writer (as I see it) and its not the pressing requirement that I have, its an improvement (for me)

to use something like

void serialize(Instance instance) {
  if (ExternalClassWithVisibilityRules.shouldSerialize(instance, "x", dsl)) { // which is default value
    //serialize x
  }
}

we would have to read instance.x to check if it was something to be handled specially etc, and we can't customise the value written. Are you proposing that the code generated is just checking if the default value is serialised, not actually if the field with a non default value is serialised?

zapov commented 1 month ago

For me those are 2 separate problems. The feature of masking output so you don't leak secrets you can already do in multiple ways. Eg ideally you would have special types for passwords and you can assign type converter on that type and mask them on output. There is an example here how to do it for time type: https://github.com/ngs-doo/dsl-json/blob/master/examples/MavenJava/src/main/java/com/dslplatform/maven/Example.java#L184 You can even put it on something more generic (like a String)... but its always better to have more specific types for such cases. Alternatively you can assign converters on explicit properties (which makes sense for passwords). So I would say, masking output value is covered with existing features.

What dsl-json does not have ATM is this logic when you don't want to put anything out. Thats why I was suggesting to only cover that case. Technically if we extend the API to pass in writer, you can write something out and then tell dsl-json not to serialize the regular property. I'm kind of ok with it because I think it would work out of the box, although I think it would be better not to allow it for now (as there might come some future need to prevent it).

mkeskells commented 1 month ago

I think that we are talking at cross purposes

For specific field types - I am aware that we can use custom convertors to mask out passwords, and other special types that can be identified as part of the schema definitions, but his is a run-time catch, that serves the purposes of catching missed configuration, and when we need to apply additional rules and cannot just shut the system down, wait for a rebuild, and new release etc and then continue. The current facilities dont allow this to be addressed fast enough for our needs, we need to be able to change the rules on demand

For masking the values - There is no facility within the library to control the output values (on all fields) which is again what we need. This applies to any filed written that looks to contain data that we don't want to be in the json form, not just for some known fields. This also has the requirement that is needs to be immediate. We would need to sign a custom writer for every field of every object, and for any keys and value of maps etc, and then after that there are some edge case with existing custom writers to consider, and collections etc

You are right that there are 2 separate requirements, but you solution, if I understand it only addresses one of then, and requires a rebuild, which makes it unappropraite for our needs

zapov commented 1 month ago

So what that sounds to me is like we can certainly improve the signature of DslJson's tryFindReader, tryFindWriter and tryFindBinder to pass in more information there (location of the object, eg its class and property name) so you can handle this rules whenever something is trying to be written. This way you can instead return a writer which will mask the value, while knowing in which object and for which property this is, to be able to apply the rules - do not leak this property of this object.

The signature I suggested for deciding if we want to write out the field at all would certainly let you do the same (as you have the instance - so can take the instance class, its passing the property name and you have the DslJson instance which can contain this rules (as you can provide your own DslJson instance).

mkeskells commented 1 month ago

So - what are you expecting the generated code to look like (roughly)

zapov commented 1 month ago

Are you asking about tryFindReader improvement? I would focus on that for now and we can later address the remaining issue.

I would add

public JsonReader.ReadObject<?> tryFindReader(Type manifest, String attribute, Class<?> location)

deprecate the old methods and redirect them to this one. Then generated code should just call them to provide this relevant attributes, eg instead of

this.reader_property = __dsljson.tryFindReader(manifest_property);

have something like

this.reader_property = __dsljson.tryFindReader(manifest_property, "property", com.dslplatform.json.models.UnknownTypeWithConverter.Generic.class);

btw. you didn't say does this address your need in full (at least around this masking needs)

mkeskells commented 1 month ago

I presume you meant __dsljson.tryFindWriter Are you proposing that for every field that is written, we call __dsljson.tryFindWriter(...) and that determines what is written?

this would allow us to control the writing, and allow us to whitelist some types (e.g. for values that are numbers and booleans)

It would mean that we end up duplicating the writing code for array, maps etc, as we would need to validate the data prior to writing it, rather than interception the write. I don't think that there is a lot of code here though

It would mean that it would not work with a customed writer as I see it. E.g. one that writes something not generated by the plugin, but hand crafted

It would also mean that we could not omit a field, or change the property name written

Looking in writeProperty, there seem to be lots of paths that don't use the writer_<fieldName>, e.g. where there is a convertor, its a JsonObject, there is a target.convertor, or an optimisedConvertor

Have I understood what you are proposing? Just want to confirm I have understood, and that we are talking about the same thing, before I spend much effort of thinking & coding

zapov commented 1 month ago

Yes, while tryFindWriter is relevant for your use case, I think this pattern applies to all 3.

I'm mostly trying to confirm that extending the tryFind API in such a way does resolve your use case (your second point in requirements) It does not affect converters, but I'm ok if there is an option to not use optimized converters, or even allow declaration of your own conversion to override optimized converters. Explicit converters I think are different kind of problem as you should be able to have your own rules around them and eg either disallow them, or make sure they go through the approved converters only. I would same JsonObject is the same, as in... if you want to prevent its usage, you can implement some rules around compilation to not allow it, or only whitelist certain ones.

As for the second part, I've looked over the API and it gets a bit more complicated with managing boolean hasWritten if we allow passing the writer around, so I would avoid that for now. This does prevent you from writing some other name instead, but seems simpler to reason and good enough.

Anyway... I would support this by having CompiledJson annotation with

Class visibilityCheck() default CompiledJson.class;

which would point to class such as ExternalClassWithVisibilityRules and expect it to have shouldSerialize method with relevant signature (method with class compatible signature, property name and dsl instance) with boolean signature for result.

Generated code would look something like

if (ExternalClassWithVisibilityRules.shouldSerialize(instance, "x", dsl)) { // returns boolean
  //serialize x
}

The other question is should we just leave MINIMAL as a policy or introduce a new one eg: CONDITIONAL as mostly the main consequence is that since this would only affect writeContentMinimal (and lets leave the name as is for now - we can change it later in another PR) what kind of code should be embedded inside this //serialize x part If we leave only MINIMAL I would assume there could be expectations that value is always serialized when it passes this outer check (which is probably ok) and it would mean that when visibilityCheck is defined, instead of checking for default value in there, it should just write value out. If there is a need to retain previous MINIMAL behavior even after this external checks, then it does make sense to introduce additional one: CONDITIONAL. But maybe for simplicity sake we could start with just when visibilityCheck is defined always just write it out.

Unrelated, but by looking at the code it seems to me that there is a bug in this serialized code and there is a missing check to ensure size of the buffer. Eg in front of

writer.writeByte((byte)','); hasWritten = true;

there should be

writer.ensureCapacity(2);

just to make sure that when flushed to stream we can still go back one place due to conditional ending.