segmentio / Analytics-CSharp

The hassle-free way to add Segment analytics to your C# written app (.Net/Xamarin/Unity).
MIT License
21 stars 8 forks source link

Events getting lost or not getting flushed #61

Closed al2114 closed 1 year ago

al2114 commented 1 year ago

We are using the analytics-csharp library for our mobile Unity app to track events to Segment. We are observing events not getting tracked correctly for some of our users and wondering whether tracked events that haven't been flushed yet persist or get dropped upon app crash?

wenxi-zeng commented 1 year ago

hey @al2114, thanks for reporting the issue. can you elaborate more on "events not getting tracked correctly"? are you experiencing event lost or event with incorrect data? if it's event lost, how did you observed the lost? through the debugger on segment workspace or other metrics? it'd be great if you could mention which version of analytics-csharp that you're using.

if you're observing events lost because some of them did not show up in the segment debugger, the events are not necessary lost, because segment debugger samples the events to show up, instead of showing all of them.

in regards to your questions, yes, analytics-csharp persist data to disk at a per event basis. events are persist to the disk before it's flushed. so even if the app is crashed, the data is not lost. it is uploaded upon the next app open. it's very unlikely to experience a buffered queue, unless you're building up hundreds of thousands events continuously in a tight loop.

however, if you're using InMemoryStorageProvider, all events are lost when app is crashed, because it's purely in memory.

al2114 commented 1 year ago

Hey @wenxi-zeng, appreciate the quick response! We believe we're seeing events getting lost. We can confirm activity on the app for these subset of users based on other events from our backend and through our logs, but not receiving any of the events from Segment at our destinations (it is working correctly for most our users, only some seem to be impacted here). Only lead we have is from one of the impacted users, they reported experiencing a crash so suspecting if it could be related... However we do also have our flush time set at 30 seconds, and we also see in our logs that app session length definitely is longer than 30 seconds, so strange that no events are coming through.

We are using version 2.2.0 of the library.

wenxi-zeng commented 1 year ago

@al2114 thanks for the context! would you mind me ask which destinations you've seen the problem? and which platform is this issue occurred (android or ios)? last couple things to confirm: are there no delivered events at all for the impacted user after the crash, or just some of the events lost? are the lost events (in destinations) delivered to segment at all? we'll investigate and see if we could replicate the problem.

analytics-csharp is going for GA next week. there will be more debugging tools such as error handler and logger for you to handle event tracking when error occurs. it will also allow you to custom your flush strategy by implement a flush policy, so you will have more control over the stored events.

in the meanwhile, if you don't wanna wait up and suspect that's the storage causing the issue, you could implement your own StorageProvider.

al2114 commented 1 year ago

That's great to hear @wenxi-zeng, hopefully those tools will be helpful for us to triage. We just did a little more digging and it seems like one common thread across some of the customers impacted here are they seem to all be running on devices to do not support latest iOS version (Running on anywhere between iOS 12 - 15)

We also managed to capture one error log from device console logs coming from our Segment library if this means anything:

(SegmentAnalytics): One or more errors occurred. (Collection was modified; enumeration operation may not execute.)
wenxi-zeng commented 1 year ago

@al2114 that's super helpful! will take a look

al2114 commented 1 year ago

Updated the title to reflect the problem more correctly

wenxi-zeng commented 1 year ago

hey @al2114 we have a lead on the issue. it's very likely a json object is modified during enumeration. we're updating the json library to be thread safe and also evaluating other places in the Analytics library that could run into concurrent modification issue. a fix will be shipped next week. thanks again for bring this issue to our attention

al2114 commented 1 year ago

Hi @wenxi-zeng thanks for address the issue with quick fix – do you suspect that concurrency issue can cause all events tracking to fail for the particular user's device? We actually do have one device from our team members that we have been trying to use to investigate the issue and we see the issue persistent on app sessions (even if the app gets killed and relaunched), events on this device do not reach Segment servers at all.

sandolkakos commented 1 year ago

Hey @wenxi-zeng, when do you guys use to merge main into upm branch in order for us to have a new package version to use in our Unity projects?

wenxi-zeng commented 1 year ago

@al2114 it's possible if you're using cloud mode destination. the SDK setup the destination metadata by iterating though the cloud mode destination json object. that could cause the issue. also, if you're using our unity lifecycle example, there is bug that I fixed in the pr could cause the same issue.

@sandolkakos we're having our GA release this week. it'd be available for all platforms at latest this Friday. I'm trying my best to expedite at least the release for the SDK sooner, since this is an outstanding issue. sorry for the inconvenience it may cause.

wenxi-zeng commented 1 year ago

hey @al2114 and @sandolkakos, we have released a 2.3.0-alpha.2 that has the fix. you're welcome to try it out see if it solves the issue, but it's not suggested to use it in production, since this is a pre-release version meant for testing. the official 2.3.0 will be released in 2 days.

sandolkakos commented 1 year ago

Thanks a lot, @wenxi-zeng. I'm gonna trigger a new internal build using 2.3.0-alpha.2 version and ask our team member to test the build to see if we receive logs from her device.

al2114 commented 1 year ago

Hi @wenxi-zeng, we integrated the new alpha build and tested on impacted device however are still not seeing events showing up on Segement server end. We also do not see any exceptions show up here. I know you mentioned additional debugging tooling, is this in the alpha build and any advice on how we may might be able to add additional logging or debug further?

wenxi-zeng commented 1 year ago

hey @al2114, the logger and error handler are in the alpha version. the official 2.3.0 is also out. you can give it a try. please follow the example here to setup Analytics.Logger and IAnalyticsHandler. you can see more samples here

wenxi-zeng commented 1 year ago

in this line, please check if the exception is AggregateException. if so, you'd have to go through the inner exception to output error message and stacktrace (see here), since AggregateException's stacktrace is always null

sandolkakos commented 1 year ago

Thanks for the hints @wenxi-zeng, we are gonna give them a try and will come back with results.

wenxi-zeng commented 1 year ago

hey @sandolkakos and @al2114, please let us know how did you reproduce this issue. we want to investigate it on our end too. it'd be very helpful if you could provide console logs.

if you've registered to Analytics.Logger, look for this log:

Analytics starting = 

this will tell if the event is missing because of fetching settings or failure of flushing.

sandolkakos commented 1 year ago

hi @wenxi-zeng, we unfortunately have only one person in our team that can 100% reproduce that problem, but we did not understand yet which condition is causing it (she is not a developer and also does not live close to us to make some debugging on her device). So everytime we have a build, we ask here to use our App and check if Analytics are coming.

We have now applied the changes according to the samples you have suggested and, as soon as we have test results, we come back to let you know.

Thanks.

sandolkakos commented 1 year ago

hey @wenxi-zeng, we have tested version 2.3.0 but unfortunatelly the events data are not coming yet. Here are the logs we have collected (with censored user id, email and phoneNumber). All the lines containing SegmentLogger: are new logs added on version 2.3.0: REDACTED

wenxi-zeng commented 1 year ago

hey @sandolkakos, thanks a lot for posting the logs! it's very helpful on spotting the issue

from the log, looks like everything runs as expected. the following indicates that analytics is up and running:

2023-09-27 07:30:07.693 | [36] (SegmentAnalytics): SegmentLogger: Message: Analytics starting = True 

and below indicates the events being persisted to disk:

2023-09-27 07:30:07.727 | [37] (SegmentAnalytics): SegmentLogger: Message: Segment.io running  ************

and then below indicates the events are successfully uploaded to the server with no problem. as you can find in this line, this log is only outputted if upload is completed.

2023-09-27 07:30:36.835 | [1517] (SegmentAnalytics): SegmentLogger: Message: Segment.io uploaded ****

so let's try following:

  1. make sure if you are using ConcurrentList if you followed our sample code to build a singleton.
  2. double check on segment's workplace debugger, see if you received the events above the uploaded log (i.e. ReadingScreen, Game Started, etc.)
  3. Switch to InMemoryStorageProvider like this, see if the issue still persists. if so, we rule out the possibility the issue is caused by storage
  4. Switch to a different HttpClientProvider (if you use the default one, switch to UnityHttpClientProvider, and verse-vise), see if the issue still persists. if so, we rule out the possibility the issue is caused by the sdk. if you switch to UnityHttpClientProvider, you can add more logs to see if the request went through.

thanks for bearing with us on this.

wenxi-zeng commented 1 year ago

@sandolkakos I just noticed that all the events has:

"library": {"version": "2.2.0","name": "Analytics-CSharp"} 

double check if you're using 2.3.0. that device might use a cache version of the app. I'd suggest completely uninstall the app before re-installing

sandolkakos commented 1 year ago

Sorry for the delay to reply you here @wenxi-zeng, last week was super rushed. But I was finally able to find one of our devices where I could reproduce this issue and will continue the tests. The device is a iPad Mini 2 - iOS 12.5.7.


Replying your questions separately:

@sandolkakos I just noticed that all the events has:

"library": {"version": "2.2.0","name": "Analytics-CSharp"} 

double check if you're using 2.3.0. that device might use a cache version of the app. I'd suggest completely uninstall the app before re-installing

That was my fault, cause I did not notice the context is initialized with hard-coded library values. But I can confirm we are using the correct package version: 2.3.0.


  1. make sure if you are using ConcurrentList if you followed our sample code to build a singleton.

That is correct. I've just adapted our code to follow your sample code.


  1. double check on segment's workplace debugger, see if you received the events above the uploaded log (i.e. ReadingScreen, Game Started, etc.)

Using same build number and same Wi-Fi connection, I can see Segment Debugger receiving events from devices:

But not from device:


  1. Switch to InMemoryStorageProvider like this, see if the issue still persists. if so, we rule out the possibility the issue is caused by storage

I'm gonna test this one 👍🏽


  1. Switch to a different HttpClientProvider (if you use the default one, switch to UnityHttpClientProvider, and verse-vise), see if the issue still persists. if so, we rule out the possibility the issue is caused by the sdk. if you switch to UnityHttpClientProvider, you can add more logs to see if the request went through.

I'm gonna also test this one 👍🏽

sandolkakos commented 1 year ago

Hey @wenxi-zeng, I've implemented my own HttpClientProvider and here you can see the logs. The Post request is getting a response with StatusCode = 200, meaning Segment receives the data and give us an Okay.

public sealed class ElloHttpClient : HTTPClient
{
    private readonly ILogger _logger;

    public ElloHttpClient(string apiKey, string apiHost, string cdnHost, ILogger logger)
        : base(apiKey, apiHost, cdnHost)
    {
        _logger = logger;
    }

    public override async Task<Response> DoGet(string url)
    {
        _logger.Log($"ElloHttpClient.DoGet - Url: {url}");

        UnityWebRequest request = UnityWebRequest.Get(url);
        await request.SendWebRequest();

        _logger.Log($"ElloHttpClient.DoGet - StatusCode: {request.responseCode}");
        _logger.Log($"ElloHttpClient.DoGet - Response: {request.downloadHandler.text}");

        return new Response { StatusCode = (int)request.responseCode, Content = request.downloadHandler.text };
    }

    public override async Task<Response> DoPost(string url, byte[] data)
    {
        _logger.Log($"ElloHttpClient.DoPost - Url: {url}");
        _logger.Log($"ElloHttpClient.DoPost - Data: {string.Join(",", data)}");

        string postData = Encoding.UTF8.GetString(data);
        _logger.Log($"ElloHttpClient.DoPost - PostData: {postData}");

        UnityWebRequest request = UnityWebRequest.Post(url, postData, "application/json");
        await request.SendWebRequest();

        _logger.Log($"ElloHttpClient.DoPost - StatusCode: {request.responseCode}");
        _logger.Log($"ElloHttpClient.DoPost - Response: {request.downloadHandler.text}");

        return new Response { StatusCode = (int)request.responseCode, Content = request.downloadHandler.text };
    }
}

Logs: REDACTED

wenxi-zeng commented 1 year ago

thanks @sandolkakos! were you able to see the events in segment debugger and in the destination with your own HttpClientProvider?

wenxi-zeng commented 1 year ago

hey @sandolkakos I believe I found the issue. the batch file is malformed because of an interrupted event write, which causes uploading to be failed. you can verify this by formatting the json string in ElloHttpClient.DoPost - PostData. the first event in the batch only has half the content.

it's actually more complicated than I thought. each event in the batch file has no ending bracket. but in Segment.io running log, the event is serialized properly. it might be that outdated ios version doing something different on write.

can you also replace the storage provider with InMemoryStorageProvider, see if the in memory batch file is still malformed?

al2114 commented 1 year ago

Curious, we're receiving 200 status with success: true, would malformed data provide such a response from the request response?

sandolkakos commented 1 year ago

thanks @sandolkakos! were you able to see the events in segment debugger and in the destination with your own HttpClientProvider?

No, it was still the same. No events in Segment Debugger and neither in MixPanel.

sandolkakos commented 1 year ago

hey @sandolkakos ~I believe I found the issue. the batch file is malformed because of an interrupted event write, which causes uploading to be failed. you can verify this by formatting the json string in ElloHttpClient.DoPost - PostData. the first event in the batch only has half the content.~

it's actually more complicated than I thought. each event in the batch file has no ending bracket. but in Segment.io running log, the event is serialized properly. it might be that outdated ios version doing something different on write.

can you also replace the storage provider with InMemoryStorageProvider, see if the in memory batch file is still malformed?

Wow, just tested the build I've triggered with the InMemoryStorageProvider and the events is getting displayed on Segment Debugger and also in MixPanel.

logs:

REDACTED

Print Screen:

image

wenxi-zeng commented 1 year ago

Curious, we're receiving 200 status with success: true, would malformed data provide such a response from the request response?

yes. this is by design on segment's backend. it returns 200 as long as the call succeeds regardless the content. I know this is very counter-intuitive and confusing. but it's out of the SDK's control

wenxi-zeng commented 1 year ago

@sandolkakos that's awesome! now we can finally confirm the root cause is the MS FileStream.Seek API behaves differently on ios 12. it does not start from position end + 1, but from end. that's why all the events are missing ending brackets.

I'll get an alpha version out today to fix this issue. please bear with me, we are almost there. thank you so much for being so collaborative and patient with us.

FYI, you have full control over the storage by writing your own storage provider. so if you wanna explore it in the meanwhile, try it out. :)

sandolkakos commented 1 year ago

@sandolkakos that's awesome! now we can finally confirm the root cause is the MS FileStream.Seek API behaves differently on ios 12. it does not start from position end + 1, but from end. that's why all the events are missing ending brackets.

I'll get an alpha version out today to fix this issue. please bear with me, we are almost there. thank you so much for being so collaborative and patient with us.

FYI, you have full control over the storage by writing your own storage provider. so if you wanna explore it in the meanwhile, try it out. :)

We are gonna wait for the fix. Thanks to you as well for always giving us directions here.

wenxi-zeng commented 1 year ago

hey @sandolkakos after further investigation, we found out the root cause actually is due to non-ascii characters presented in the payload, causing incorrect length of measure when writing data to file. please try out 2.3.1-alpha.2, it now supports unicode characters. let us know if it fixes the issue. we'll get the official 2.3.1 out once it's confirmed.

in case you're interested, the result of

Console.WriteLine("Length of \"Marllon’s iPad Mini 2\": " + "Marllon’s iPad Mini 2".Length);
Console.WriteLine("Bytes of \"Marllon’s iPad Mini 2\": " + "Marllon’s iPad Mini 2".GetBytes().Length);

turned out to be:

Length of "Marllon’s iPad Mini 2": 21
Bytes of "Marllon’s iPad Mini 2": 23
sandolkakos commented 1 year ago

Hi @wenxi-zeng, I can confirm that special character is the cause of this issue. I've removed that special character, restarted my device to kill any cached device name. Once I opened our App, the events got received in Segment and MixPanel. 🙌🏽 Note: On iPad Air 5 - iOS 16.6.1 this issue does not happen.

I'm gonna test your new package version and will let you know the results.

wenxi-zeng commented 1 year ago

🙌 great to hear that! yes. os behaves differently. some os I tested replace unicode character to it's \u counterpart in serialization, which works fine. some os pass it as is, which causes problem. I also reproduced this issue on iphone 14 + ios 16.6.1, if I track with unicode char, no event are delivered. but with 2.3.1-alpha.2, there is no problem

al2114 commented 1 year ago

We also realized the reason why we haven't been seeing this prevalently on our users running iOS 16+ is because iOS 16 introduced a privacy feature that defaults the device name without this entitlement:

https://developer.apple.com/documentation/bundleresources/entitlements/com_apple_developer_device-information_user-assigned-device-name

Thanks for helping get to the root of this @wenxi-zeng

sandolkakos commented 1 year ago

I can confirm the version 2.3.1-alpha.2 fixes the issue: image

Thanks a lot, @wenxi-zeng !

wenxi-zeng commented 1 year ago

@al2114 that makes so much sense! I was doing an explicit analytics.Track("Marllon’s iPad Mini") in ios 16 and reproduced this issue.

@sandolkakos awesome!!!! I'm gonna get the official 2.3.1 out.

thank you all for helping me to get this far!

wenxi-zeng commented 1 year ago

2.3.1 is now available in openupm. closing this issue out.

thank you @al2114 and @sandolkakos again! it's been a pleasure!