rakam-io / rakam-api

📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)
https://rakam.io
GNU Affero General Public License v3.0
799 stars 105 forks source link

Why do string event properties get cropped after 100 characters? #111

Closed iamakulov closed 5 years ago

iamakulov commented 5 years ago

Recently, we found out Rakam silently crops long event properties to 100 characters. So, for example, if you do something like this:

rakam.logEvent('eventName', {
  nodeDataset: JSON.stringify(someHtmlNode.dataset),
});

and JSON.stringify(someHtmlNode.dataset) turns out to be longer than 100 characters, only first 100 characters would be saved to DB.

We were saving relatively large JSON-encoded objects as one of event fields, and, because of this cropping, we lost an important part of data.

Looks like this is the code that does this:

https://github.com/rakam-io/rakam/blob/656c168d78aeb3058388df22ee88f21e391eadfd/rakam-postgresql/src/main/java/org/rakam/postgresql/analysis/PostgresqlEventStore.java#L295-L297

iamakulov commented 5 years ago

For the reference, I found two more places where cropping happens:

https://github.com/rakam-io/rakam/blob/656c168d78aeb3058388df22ee88f21e391eadfd/rakam/src/main/java/org/rakam/collection/JsonEventDeserializer.java#L496-L498

https://github.com/rakam-io/rakam/blob/656c168d78aeb3058388df22ee88f21e391eadfd/rakam/src/main/java/org/rakam/collection/CsvEventDeserializer.java#L174-L176

In case of JsonEventDeserializer.java (the first snippet), the maxStringLength property is configurable – you can change it by putting something like

collection.max-string-length = 999999

to Rakam’s config.properties. However, strings are still cropped when they are saved to the database (see the code snippet in the previous message), so the config doesn’t help.

buremba commented 5 years ago

@iamakulov we crop the value of strings which have more characters than expected because the data collected by Rakam is used for analytical purposes. Our customers usually don't store big string blobs, instead, break down the string values and send them as new attributes such as User-Agent values.

As you already figured out, the value is configurable in config.properties and we intentionally made it configurable in server side. The idea is that the data is collected from the users and it's not reliable in that sense. Therefore we try to sanitize the user input as much as possible in order to be able to provide a reliable system.

iamakulov commented 5 years ago

Got it, thanks!

iamakulov commented 5 years ago

and send them as new attributes such as User-Agent values.

BTW, just in case it’s relevant: this is the solution I was doing initially, but at some point, I got into the limit of 200 custom fields per collection, so I had to start encoding data into larger strings.

iamakulov commented 5 years ago

For anyone affected: in the end, I solved the cropping issue by removing .substring() branches in JsonEventDeserializer.java and PostgresqlEventStore.java in my own fork. Now Rakam saves strings of arbitrary length.

If you use rakam-cookbook for deploying Rakam (this is the case with the Rakam’s AWS CloudFormation template), you can switch it to use your fork by searching for buremba/rakam in the source code and replacing found GitHub links to your own repo.