Large dataset deserialization

phatcher commented 5 years ago

@object I don't think there's much more I can do to improve the performance of ToObject, as it now has a low overhead e.g. in the latest performance profile ToObject cost approx 33 µs per record with a total time spent in Simple.OData.Client of 1.4s vs 5.8s for Microsoft (largely Edm).

So the predominant cost is OData parsing from the raw json, which we can see from the benchmark results where using the library versus using Newtonsoft.Json is about 10x more expensive.

You posted a fragment which allows access to the HttpResponseMessage e.g.

 var clientRequest = await client
                .For<Transfers>()
                .Filter(x => some filter)
                .BuildRequestFor()
                .FindEntriesAsync();
var clientResponse = await clientRequest.RunAsync();
var responseMessage = clientResponse.ResponseMessage

I have one use case which is retrieving a lot of records (~22k) and the difference is dramatic 4.5s vs 8s and doing the usual profiler run you can see that it's all spent in parsing...

SoC deserializer

NewtonSoft Json

I've taken your code and generalized it slightly introducing a new interface IODataDeserializer...

public interface IODataDeserializer
{
    Task<IEnumerable<T>> FindEntriesAsync<T>(HttpResponseMessage message, CancellationToken cancellationToken);

    Task<T> FindEntryAsync<T>(HttpResponseMessage message, CancellationToken cancellationToken);
}

Then we can have a default implementation as follows...

public async Task<IEnumerable<T>> FindEntriesAsync<T>(HttpResponseMessage message, CancellationToken cancellationToken)
{
    var json = await message.Content.ReadAsStringAsync();

    var result = JsonConvert.DeserializeObject<ODataResponse<T>>(json);

    return result.Value;
}

public async Task<T> FindEntryAsync<T>(HttpResponseMessage message, CancellationToken cancellationToken)
{
    var result = await FindEntriesAsync<T>(message, cancellationToken);

    return result.FirstOrDefault();
}

And some extension methods

public static class ODataExtensions
{
    public static async Task<IEnumerable<T>> FindDeserializeEntriesAsync<T>(this IBoundClient<T> client, IODataDeserializer deserializer = null, CancellationToken cancellationToken = default(CancellationToken))
        where T : class
    {
        var query = await client.BuildRequestFor()
                .FindEntriesAsync(cancellationToken);

    return await deserializer.FindEntriesAsync<T, T>(query, cancellationToken);
    }

    public static async Task<IEnumerable<TDestination>> FindDeserializeEntriesAsync<T, TDestination>(this IBoundClient<T> client, IODataDeserializer deserializer = null, CancellationToken cancellationToken = default(CancellationToken))
    where T : class
    where TDestination : class
     {
    var query = await client.BuildRequestFor()
                .FindEntriesAsync(cancellationToken);

    return await deserializer.FindEntriesAsync<TDestination, T>(query, cancellationToken);
    }

    public static async Task<IEnumerable<T>> FindDeserializeEntriesAsync<T>(this IBoundClient<IDictionary<string, object>> client, IODataDeserializer deserializer = null, CancellationToken cancellationToken = default(CancellationToken))
    where T : class
    {
    var query = await client.BuildRequestFor()
            .FindEntriesAsync(cancellationToken);

    return await deserializer.FindEntriesAsync<T, IDictionary<string, object>>(query, cancellationToken);
    }

    private static async Task<IEnumerable<T>> FindEntriesAsync<T, TSource>(this IODataDeserializer deserializer, IClientWithRequest<TSource> client, CancellationToken cancellationToken)
    {
    var response = await client.RunAsync(cancellationToken);
    var message = response.ResponseMessage;

    deserializer = deserializer ?? new JsonODataDeserializer();

    return await deserializer.FindEntriesAsync<T>(message, cancellationToken);
    }
}

The naming is so that it doesn't clash with the existing method on IBoundClient etc. Using it then becomes very easy..

var query = Client.For<Measurement>()
          .Filter(x => x.AssetId == asset.Id)
          .OrderByDescending(x => x.MeasurementDate);

var result = await query.FindDeserializeEntriesAsync();

One benefit of this method is that the FindDeserializeEntitiesAsync can take a different type that the source entity.

Couple of points open for discussion...

Naming of the extension method
Whether we provide a default implementation in the library or leave it up to consumers, avoids taking a dependency on Newtonsoft.Json.
Unifying with existing FindEntriesAsync methods marking some as obsolete prior to v6.0

I don't think this becomes the default behaviour since you need the complex OData parsing for some domain models, but there are definite use cases where a simpler deserialization is useful.

object commented 5 years ago

Hi Paul, and thank you for new idea about how to improve serialization performance. I agree we need to find a way to customize serialization to improve the speed offered by ODataLib converters, but while the easiest way is to offer extension methods for IBoundClient, I am not sure it's the best one. My problem with this approach is that IBoundClient already has a huge number of methods, just FindEntriesAsync has 8 (!) overloads, and most of other methods have also mutlipe overloads. Adding just a couple of new extension methods makes such extension inconsistent with the rest of the interface. And what about overloads for UpdateEntriesAsync, DeleteEntriesAsync? What about OData action and function counterparts?

I believe we need to explore two possible directions (there may be more):

Enable plugging in a custom serializer so IBoundClient doesn't need to be extended, it will just use the chosen serializer.
If this becomes hard or impossible using current design, we should focus on enabling do what you suggest but outside the main library, either as an add-on NuGet package or as an extension point for developers who can use it and add whatever extension methods they need. Then consistency of these methods won't be an issue.

These are just some rough thoughts, I haven't investigated if they fit current code base.

sixlettervariables commented 5 years ago

I'm in favor of Option 1. Beyond cases where NewtonSoft.Json or Manatee.JSON could be used, it would also enable plugging in the newer UTF-8 optimized JSON Reader from Microsoft.

object commented 5 years ago

Me too. Now it's just to figure out how to do it :-)

phatcher commented 5 years ago

One thing I notice on IBoundClient is that we typically have two overloads for with and without CancellationToken support which makes extending the API painful e.g.

/// <summary>
/// Retrieves an entry by executing OData GET request.
/// </summary>
/// <returns>The first of the entries found.</returns>
Task<T> FindEntryAsync();

/// <summary>
/// Retrieves an entry by executing OData GET request.
/// </summary>
/// <param name="cancellationToken">The cancellation token.</param>
/// <returns>The first of the entries found.</returns>
Task<T> FindEntryAsync(CancellationToken cancellationToken);

whereas we could achieve the same by making cancellationToken an optional parameter with a default e.g.

/// <summary>
/// Retrieves an entry by executing OData GET request.
/// </summary>
/// <param name="cancellationToken">The cancellation token.</param>
/// <returns>The first of the entries found.</returns>
Task<T> FindEntryAsync(CancellationToken cancellationToken = default(CancellationToken));

Now whilst that is technically a breaking change to the API, I don't think any client would notice since we are not strongly named. This works as the cancellationToken is the last parameter on all the API calls and this design seems to be how most Microsoft APIs introduce CancellationToken support. I did a quick check and all the tests pass with no code change.

More generally can we look at the API as a whole and work out which bits should be marked as Obsolete so we can retire them in v6

object commented 5 years ago

Yes, this is a possibility, but it will require major version change because replacing two method calls with one makes the library binary incompatible with older clients, i.e. every client that used a version without cancellation token will break unless recompiled.

phatcher commented 5 years ago

Ah, I'd forgotten about the binary compatibility - but who updates to a new version of a library without recompiling?

sixlettervariables commented 5 years ago

@phatcher we do in-place upgrades if we've already shipped the main product and are addressing a third party bug with a point release in the third party software. (side note: our main installations are all air gapped, so not easy to do updates of any size)

object commented 5 years ago

@phatcher we must respect versioning rules. And it's cheaper to bump a version than to break somebody's code.

phatcher commented 5 years ago

@sixlettervariables Ah ok, I've come across that sort of installation issue before but forgotten about it.

@object I've got a branch on my fork which tidies up the interface, do we want to create a v6 branch to do the experimental API stuff on?

object commented 5 years ago

@phatcher yes I think it would be useful. But maybe do it first not as v6 but new branch for each new feature (interface revision can be a separate feature branch). Then it will be easier to decide which of them go into v6.

phatcher commented 5 years ago

@object So the changes to simplify the interfaces for CancellationToken are on 568_CancellationToken, all the tests pass so it's just a matter of whether you want to do it.

object commented 5 years ago

They look good and indeed simplify maintenance of the library a lot. I guess once you finalize the other branch (with type converters) we can release 6.0 based on them (perhaps also some fixes).

object commented 5 years ago

And of course custom serialization would be another great reason for major version bump. Just need to figure it how to implement it without changing fluent interface methods.

raviraj1976 commented 5 years ago

Hi Paul,

Why can't we use ReadAsStreamAsync from HttpResponseMessage instead of ReadAsStringAsync to benefit from the performance as suggested in https://www.newtonsoft.com/json/help/html/Performance.htm) . We can also pass JsonSerializer settings as parameter which will be used during deserialization. We can write FindEntriesAsync method like below. We can use ContractResolver with JsonSerializer to trim() the spaces if any from the OData response during deserialization.

Example

var jsonSerializer = new JsonSerializer { NullValueHandling = NullValueHandling.Ignore, MissingMemberHandling = MissingMemberHandling.Ignore, ContractResolver = new TrimConverterResolver() };

public async Task<IEnumerable> FindEntriesAsync(HttpResponseMessage message, JsonSerializer jsonSerializer, CancellationToken cancellationToken) { JsonSerializer jsonSerializerSettings;

// Default JsonSerializer if the JsonSerializer parameter is null if (jsonSerializer == null) { jsonSerializerSettings = new JsonSerializer(); }

using (var stream = await httpResponseMessage.Content.ReadAsStreamAsync().ConfigureAwait(false))
{
    using (var streamReader = new StreamReader(stream))
    {
        using (var jsonReader = new JsonTextReader(streamReader))
        {               

            return jsonSerializer.Deserialize<ODataResponse<T>>(jsonReader).Value;
        }
    }
}

} using System; using Newtonsoft.Json.Serialization;

/// <summary>
/// TrimConverterResolver implements IContractResolver
/// </summary>
public class TrimConverterResolver : DefaultContractResolver

{ ///

/// CreateContract ///

/// objectType> /// JsonContract protected override JsonContract CreateContract(Type objectType) { JsonContract contract = base.CreateContract(objectType);

    // this will only be called once and then cached
    if (objectType == typeof(string))
    {
        contract.Converter = new TrimmingConverter();
    }

    return contract;
}

}

using System; using Newtonsoft.Json;

/// <summary>
/// TrimmingConverter
/// </summary>
public class TrimmingConverter : JsonConverter

{ /// public override bool CanRead => true;

/// <inheritdoc/>
public override bool CanWrite => false;

/// <inheritdoc/>
public override bool CanConvert(Type objectType)
{
    return objectType == typeof(string);
}

/// <inheritdoc/>
public override object ReadJson(JsonReader reader, Type objectType, object existingValue, JsonSerializer serializer)
{
    return Convert.ToString(reader.Value)?.Trim();
}

/// <inheritdoc/>
public override void WriteJson(JsonWriter writer, object value, JsonSerializer serializer)
{
    throw new NotImplementedException();
}

}

object commented 5 years ago

@phatcher I've been thinking more about your suggestion and while I definitely see the use of custom converters that would be faster than the one offered by ODataLib, I am not sure this can be achieved in a general way. Even today those who want to customise handling of response messages can grab them after slight rewrite of code:

            var request = await _client
                .For<Person>()
                .BuildRequestFor()
                .FindEntriesAsync();
            var response = await request.RunAsync();

The code above makes accessible HTTP response message, so after that developer is on his own in how this message is handled. It can be converted to string or JSON and then deserialised the best suitable way. Can this be done smoother, so people won't even need call BuildRequestFor and RunAsync and add custom deserialization behind the scenes? I am afraid it will be quite hard without losing edge cases and additional features like support for batches and annotations.

phatcher commented 5 years ago

@object Thanks for the analysis, I think you are correct and custom deserialization is out of scope, but we might need to expand some of the interfaces to allow for slightly more control.

The current issue I have is how to do custom deserialization whilst still allowing for pagination via the annotations.NextLink behaviour. For example, I have an extension in my library to retrieve pages of records/all records which takes into account the server-side pagination i.e. say the client wants a page size of 50 but the server only delivers 20 per request, we need to follow some page links...

/// <summary>
/// Get all entries for a query up to a maximum quantity
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="client">Client to use</param>
/// <param name="maxEntries">Maximum entries to retrieve</param>
/// <param name="cancellationToken"></param>
/// <returns></returns>
public static async Task<DataPage<T>> FindAllEntriesAsync<T>(this IBoundClient<T> client, int maxEntries = 1000, CancellationToken cancellationToken = default(CancellationToken))
    where T : class
{
    try
    {
        var items = await client.ToDataPage(1, maxEntries, cancellationToken).ConfigureAwait(false);
        return items;
    }
    // catch (Exception ex)
    catch (Exception)
    {
        throw;
    }
}

/// <summary>
/// Get a page of data with a specific page size
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="client">Client to use</param>
/// <param name="pageNumber">Page number to use</param>
/// <param name="pageSize">Page size to use</param>
/// <param name="cancellationToken"></param>
/// <returns></returns>
public static async Task<DataPage<T>> ToDataPage<T>(this IBoundClient<T> client, int pageNumber, int pageSize, CancellationToken cancellationToken = default(CancellationToken))
    where T : class
{
    if (pageSize <= 0 || pageNumber <= 0)
    {
        return null;
    }

    // Total Item to show per page
    var items = new List<T>();

    // Data to skip
    var itemsToSkip = pageSize * Math.Max(pageNumber - 1, 0);
    var annotations = new ODataFeedAnnotations();

    // Get the first page
    var entities = await client.Skip(itemsToSkip)
                        .Top(pageSize)
                    .FindEntriesAsync(annotations, cancellationToken)
                    .ConfigureAwait(false);
    items.AddRange(entities.ToList());

    // Server may have chunked it
    while (annotations.NextPageLink != null)
    {
        entities = await client.FindEntriesAsync(annotations.NextPageLink, annotations, cancellationToken).ConfigureAwait(false);
        items.AddRange(entities.ToList());
    }

    // Give it back.
    var page = new DataPage<T>
    {
        Data = items,
        PageNumber = pageNumber,
        PageSize = pageSize,
        Count = annotations.Count.HasValue ? (int)annotations.Count : 0
    };

    return page;
}

The problem is if I try to translate this to something with custom deserialization I can get the first page with

var query = await client.Skip(itemsToSkip)
                                   .Top(pageSize)
                                   .BuildRequestFor()
                                   .FindEntriesAsync(annotations, cancellationToken)

var response = await query.RunAsync(cancellationToken);    
                                   .ConfigureAwait(false);

var entities = await deserializer.FindEntriesAsync<T>(response.Message, cancellationToken);

where deserializer is the custom deserialization class, but then there is no overload of IRequestBuilder{T} which takes an Uri so that I can call it with annotations.NextPageLink.

BTW I don't mind contributing the code above along with the DataPage class, if you think it appropriate as it's a very common issue when you start consuming APIs as you can't necessarily control the server pagination behaviour.

phatcher commented 5 years ago

@raviraj1976 Yes using the streaming is a good idea; I tried it on my project and it's approximately 30% faster than the string deserialization.

The other thing fairly obvious thing I found, though it's worth stating, is to only select the properties you need; I saved about 25% by not returning audit columns and columns I had already filtered on, this reduces the returned payload size but also cuts down the amount of parsing of you have to do. The final result was a reduction from 7.5s to 3.3s for 10K records.

raviraj1976 commented 5 years ago

Agree. Having a select operator to get only the required columns will reduce the payload. Another option is to include gzip compression (AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip) on the OData client. Reference http://jonathanpeppers.com/Blog/improving-http-performance-in-xamarin-applications

On Mon, Jan 14, 2019 at 5:27 AM Paul Hatcher notifications@github.com wrote:

@raviraj1976 https://github.com/raviraj1976 Yes using the streaming is a good idea; I tried it on my project and it's approximately 30% faster than the string deserialization.

The other thing fairly obvious thing I found, though it's worth stating, is to only select the properties you need; I saved about 25% by not returning audit columns and columns I had already filtered on, this reduces the returned payload size but also cuts down the amount of parsing of you have to do. The final result was a reduction from 7.5s to 3.3s for 10K records.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/object/Simple.OData.Client/issues/568#issuecomment-453958728, or mute the thread https://github.com/notifications/unsubscribe-auth/ARInIlpcULdCpFCluowYginy8j9O_UqOks5vDFuCgaJpZM4ZajZW .

phatcher commented 5 years ago

@raviraj1976 I thought that was on by default, one other thought was to see what effect using BSON serialization has since it's supposed to be quicker to parse.

simple-odata-client / Simple.OData.Client

Large dataset deserialization #568