ramblingjordan / AbBOT-python

MIT License
25 stars 15 forks source link

Improve random "data" generation #12

Open santaanna2021 opened 3 years ago

santaanna2021 commented 3 years ago

Populate doctor first name / last name with random draws of first names and last names from Texas Medical Association members (to ensure they are plausible without pointing at a specific person). Populate location info with any non-PO Box zip code from texas. Switch up formats for state name and doctor name reporting (TX, Texas, etc. | First Last, Dr. Last, etc.)

Misterguruman commented 3 years ago

This is similar to the changes that were just merged, but with json files instead of CSV.

andria-dev commented 3 years ago

@Misterguruman This PR has some other improvements that would be useful. Currently, we're always using the format of "Dr. MaleFirstName LastName" but this would add variety to that and other fields like the State field.

santaanna2021 commented 3 years ago

@Misterguruman @andria-dev I'd strongly suggest implementing some of the changes that randomize doctor name and state to make our form inputs more difficult to detect. If someone figures out that the bot always fills a male doctor with the same name format (Dr. First Last), it's relatively easy to filter those entries out for additional scrutiny. Similarly, if the bot always enters "Texas" in the state field, those entries could be filtered out for additional scrutiny. The idea here is to provide a random assortment of realistic permutations that a user might enter into an uncontrolled text box. For example, by asking for "state" instead of providing a dropdown menu of states or postal abbreviations, they leave it open to the user to decide whether to enter "Texas" or "TX" or some variant of those depending on their feelings about capitalization. Similarly, by asking for a doctor "name" rather than FirstName and LastName, they leave it open to the user to decide whether to enter "Dr. First Last", "First Last", "Dr. Last", "Dr. F. Last", "F. Last", etc. By randomly selecting one of those formats, rather than always using the same format, we can make false entries difficult to filter out. Additionally, by user doctor surnames that are publicly available through https://www.texmed.org/ we can make it harder to use lists of actual doctors to filter out fake data.

andria-dev commented 3 years ago

I can work on porting this PR over to using the JSON files the other PR provided if you want. There's also another PR for real TX doctor surnames that would be super useful with this @santaanna2021

santaanna2021 commented 3 years ago

@andria-dev that would be awesome. Alternately, if time's not too critical, I'm happy to take a stab at it later this weekend when I've got some extra time. Just to be clear, the goal would be to draw the location info and firstname / lastname from the existing .jsons rather than .csvs, but maintain the format scrambling that I've got baked in now, right? I'll check back in tomorrow and take a stab at it if you haven't yet.

andria-dev commented 3 years ago

This might actually be more useful in the AbBOT-api. I haven't checked out that repo much but I'll see if this can help over there