tsegall / fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Apache License 2.0
24 stars 2 forks source link

NPE when merging analysis #95

Closed Akamthan closed 2 months ago

Akamthan commented 3 months ago

Description:

Using a csv file with multiple columns and performing profiling operation on those columns for multiple rows having DateOfBirth column which has hyphen (-) and slash (/) separated dates.

When profiling is performed on DateOfBirth, getStringConverter() this method in Facts.java file has matchTypeInfo variable which is picking format and typeModifier for first input DateOfBirth, if first date is 01-01-1999, it picks hyphen separated modifier and format and throws error for slash separated date and vice versa. We are reading "DateOfBirth" as "String" from the csv and then passing it to profiling.

Attaching relevant screenshots, code and data for which we are getting this error.

MicrosoftTeams-image image (5) image (7)

profile_e2e_customer_detail 1.csv

Kindly assist us regarding same.

tsegall commented 3 months ago

When I run the command line interface on your provided file, i.e. by doing the following:

cli/build/install/fta/bin/cli --col 12 ~/Downloads/profile_e2e_customer_detail.1.csv

The output is below:

Field 'DateOfBirth' (12) - { "fieldName" : "DateOfBirth", "totalCount" : 6, "sampleCount" : 6, "matchCount" : 4, "nullCount" : 0, "blankCount" : 0, "distinctCount" : 4, "regExp" : "\d{1,2}-\d{2}-\d{4}", "confidence" : 0.6666666666666666, "type" : "LocalDate", "isSemanticType" : false, "typeModifier" : "M-dd-yyyy", "min" : "5-05-1950", "max" : "2-03-2015", "minLength" : 9, "maxLength" : 10, "topK" : [ "2-03-2015", "9-12-1965", "4-04-1964", "5-05-1950" ], "bottomK" : [ "5-05-1950", "4-04-1964", "9-12-1965", "2-03-2015" ], "cardinality" : 4, "outlierCardinality" : 2, "invalidCardinality" : 0, "shapesCardinality" : 3, "percentiles" : [ "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "05-05-1950", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "04-04-1964", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "09-12-1965", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015", "02-03-2015" ], "histogram" : [ 1, 0, 2, 0, 0, 0, 0, 0, 0, 1 ], "leadingWhiteSpace" : false, "trailingWhiteSpace" : false, "multiline" : false, "dateResolutionMode" : "MonthFirst", "keyConfidence" : 0.0, "uniqueness" : 1.0, "detectionLocale" : "en-US", "ftaVersion" : "15.7.4", "structureSignature" : "DUlZk9jEDXwdtGUYbclz1InMJN4=", "dataSignature" : "fowY39yu6Hx5RA/G7p3lyU9ZD4w=" }

Which looks perfect. Obviously you are doing something differently and have probably encountered a genuine bug. But it is not clear from your report how to reproduce the error. Can you reproduce with a small Java program?

Akamthan commented 3 months ago

Yes sure let me run it and update here. Thanks !!

Also @tsegall from your ouput I observed that it is taking only hyphen (-) dates. It is rejecting dates with slash (/) as I can see from match count as well. Could you please verify once?

tsegall commented 3 months ago

As currently designed FTA attempts to determine a single format for a date field and then uses that to parse the field. It also attempts to intuit which field is a month vs day vs year. In your example data the year is easy to find since it is a 4 digit year. Then we are left to determine whether it is MM-dd-YYYY or dd-MM-YYYY (and possibly if it M-dd-YYYY i.e. leading zeroes are omitted). In the data you sent it looks to be a mix of M/dd/YYYY and MM-dd-YYYY although with only 6 records in your sample it might be M/d/yyyy for the examples with '/' in them and it could just as easily have been dd-MM-YYYY with the 4 valid samples with '-' as separators.

Akamthan commented 3 months ago

If you don't mind I would like to ask a question around optimization. We are trying to run your fta profiling in spark in databricks clusters. With 100k records we are able to run the profiling but when we try running with 1 million records we are frequently encountering the Null Pointer exceptions during merge operations which go away when we increase executor memory. I am attaching the logs for clarity. Is there something you can suggest that can help us with this issue? This happens when we try running profiling on 100 columns for 1 million records. error.txt

tsegall commented 3 months ago

@Akamthan What version are you running of FTA - I just want to sync to that version so I can match up the line numbers in the logs.

tsegall commented 3 months ago

@Akamthan Any chance you can enable tracing - see https://github.com/tsegall/fta?tab=readme-ov-file#reporting-issues

Akamthan commented 3 months ago

We are using fta 15.7.3 @tsegall

I will see if I can enable tracing but any input from your end will be very helpful. The line number is coming 3689 and in my local code we have only 3400 lines. This is also confusing for us

tsegall commented 2 months ago

Confused by what code you are looking at? This is (was) https://github.com/tsegall/fta/blob/main/types/src/main/java/com/cobber/fta/TextAnalyzer.java#L3689 where the issue was. I believe I have addressed the NPE with 15.7.4 please download and verify.

Akamthan commented 2 months ago

Hi @tsegall after using 15.7.4 the NPE is not coming but I see a different error now related to deserialize method in TextAnalyzer. Attaching logs - error1.txt

Akamthan commented 2 months ago

After increasing my spark version from 3.2.1->3.3.2 not seeing this error anymore. Also we tried reproducing that date error using your api but seems like its not reproducible but whenever we run it via our pipeline for certain datasets that error comes where date format is mixed. (This happens during merge operations).I am not sure how to proceed in this case.

Thanks a lot for your help !! Also really happy to see a fellow Precisely employee doing such great open-source contributions. Certainly, an inspiration 👍

tsegall commented 2 months ago

OK - I am going to close this issue since the NPE appears to be addressed. Could you please open another issue for the date issue with a full stack trace (not an image). The image does not show the full stack trace. Do you have the dataset that is causing the issue - i.e. just the column with the date field not the entire dataset.

Akamthan commented 2 months ago

Sure, will raise a new one with required details