namebrandon / Sparkov_Data_Generation

Synthetic Credit Card Transaction Generator used in the Sparkov program.
MIT License
133 stars 62 forks source link

questions about the profiles and fraud rate #3

Closed streamnsight closed 2 years ago

streamnsight commented 2 years ago

Very interesting tool. Good job there.

I have many questions about this tool though:

It indeed generates a 'realistic' dataset, but it's very unbalanced. It might be useful to be able to define the rate of fraud so as to obtain a balanced dataset (rather than generate a huge set and later downsample 90%+ of it). having that option would be useful I think.

Thanks

namebrandon commented 2 years ago

Thanks!

Profiles are best guesses... They're intended to create distinct segments that can be detected in the data. I have no idea if rural females between 25-50 are more likely to shop on Tuesday than Monday. :)

I believe your second point is accurate.

Part of working with fraud in a realistic environment is dealing with and training models with an unbalanced data set, which is why it's setup like that. The code should be easy to modify to support a variety of needs.

That being said, this is something I put together over 6 years ago for a grad school project, and is definitely unmaintained. I'm happy to review and approve pull requests though if you'd like to submit any!