vasantivmahajan / DataWrangling_AnomalyDetection_Tableau_Visualization_EDGAR_Log_Data

0 stars 1 forks source link

Data cleaning and anomaly detection approaches #1

Open vasantivmahajan opened 7 years ago

vasantivmahajan commented 7 years ago
  1. IP address: Replace with 255.255.255.255 (or some default value)
  2. Date: Copy with the previous date
  3. Time: Take the previous one + 1
  4. Zone: Max of zone (??)
  5. CIK: Get the CIK from Accession number if available, if not replace with NaN/None
  6. Accession number: Get the Accession number from CIK if available, if not replace with NaN/None
  7. Add a CIK_Accesion_Check_Flag: If they match (Outlier) --> Check if CIK, Accession number match
  8. Add Company name by fetching the information from CIK_Company name
  9. Extension: If missing fill with accession_number.txt.
  10. Create a new column file_name: if it is just .txt, fetch accession number and form it as accession_number.txt. If it is index.html or anything else, do not make any changes.
  11. Code: If size exists set 200 as the default value. Else set NaN
  12. IDX: Check with extension to fetch the details
  13. Norefer: If find != 0, then its value should be 0.0 else its 1.0. Check if this value is binary. If not anomaly
  14. Noagent: Set default value as 0.0 (Considering there is an agent)
  15. Find: Opposite of no referer
  16. Perform a check if its value is 0.0, norefer should be 1.0.
  17. Crawler: ???
  18. Browser: Replace with N/A
  19. Get company name from CIK
vasantivmahajan commented 7 years ago

Accession number format: eg: 0001193125-15-118890 0001193125 --> CIK 15 --> Year 118890 --> sequential count of submitted files from that CIK

Check if the first part of the accession number matches the CIK. If not add a flag 'Y' to the anomaly detected_accession number column

vasantivmahajan commented 7 years ago

The value in the extension field has a minus sign "-" appended before its value. Is this as expected?