Data cleaning steps - Githubissues

prasadpn commented 2 years ago

As a ML engineer, i would like to identify the steps needed to clean the data

prasadpn commented 2 years ago

@asjad248 @mdkamal07

i was going through the input file.......randomly checking stuffs. i could see text like this as well "to å°�è´ºï¼Œæ—©ä¸Šç”µè„‘å¼€æœºå¼€ä¸�å‡ºæ�¥"

need to search how to clean them up

prasadpn commented 2 years ago

@asjad248 @mdkamal07

i could also see usage of non-english text e.g.

(1) prompted at step: "auftragsausgang > von revision freigegeben in german

not sure if we have some statements in other languages as well - we may have to see how to handle them

prasadpn commented 2 years ago

@asjad248 @mdkamal07

this is an imbalanced dataset - there are some groups which have only 1 statement...it will be challenging....

we may have to search how to handle such scenarios

prasadpn commented 2 years ago

@asjad248 @mdkamal07

i would suggest not to remove the stop words for the model as at times they help in providing the context.

prasadpn commented 2 years ago

@asjad248 @mdkamal07

there is presence of escape characters as well in statements such as \r\n\r\nr -- we will have to handle them as well

prasadpn commented 2 years ago

@asjad248 @mdkamal07

for stop words ---- may be we can have some custom stop words removed....

the description column is mail body - i see Hi, hello, good morning... regards -- such words can be removed

asjad248 commented 2 years ago

For imbalanced dataset with Just 1 example- we can check for unique token frequency, uni grame, bi gram, trigram. If we found similar token we can group those category together. Also we can check for cosine similarity and can decide to group together those categories. - Just a thought

prasadpn commented 2 years ago

@asjad248 - we can explore that approach for imbalanced dataset

prasadpn commented 2 years ago

@asjad248 @mdkamal07

i was doing some search for the ©ä¸Šç”µè„‘å¼€æœºå¼€ä¸�å‡ºæ�¥" appearing in our dataset. What are they??

You can refer the below url for details -- it is called "mojibake" -- atleast now we know what it is called -- now i will look into how to handle them in python

https://www.kaggle.com/rtatman/data-cleaning-challenge-character-encodings

prasadpn commented 2 years ago

@asjad248 @mdkamal07 @LakshmiDadarkar

To handle mojibake

pip install ftfy import ftfy

cleaned_mojibake = ftfy.fix_text(text)

prasadpn commented 2 years ago

@asjad248 @mdkamal07 @LakshmiDadarkar

For translation one option is

pip install deep-translator from deep_translator import GoogleTranslator

translated_text = GoogleTranslator(source='auto', target='english').translate(text)

prasadpn commented 2 years ago

@asjad248 @LakshmiDadarkar @mdkamal07

On further investigating the data i observed that lot of Long Description column have data in the pattern

"received from: eylqgodm.ybqkwiam@gmail.com" where the portion in italics can be any mail id. Further investigation showed there are 2251 such records out of 8500 what we have

So in data cleaning part i have added a step to retrieve such email ids which follows received from: and put them in another column

i have also removed the pattern from the text we would use for analysis as i think it is not of much value in text analysis. We can leverage the new column to identify the rows which had such pattern

We can also see that there are 961 records where initially issue was triggered due to system generated mail from monitoring_tool@company.com

asjad248 commented 2 years ago

Good finding Pankaj !

prasadpn commented 2 years ago

@asjad248 @LakshmiDadarkar @mdkamal07

i also see a lot of data in this pattern - which are copies of mail.

from: to: sent: subject:

dear sir/madam

how to handle such things....i was thinking to remove such occurrences before proceeding with analysis as from, to and date wont help in further analysis. Subject might help so thinking to extract that in other column

do suggest

prasadpn commented 2 years ago

@asjad248 @LakshmiDadarkar @mdkamal07

from the description for analysis, i have removed

from: text following it till line break to: text following it till line break sent: text following it till line break date: text following it till line break subject: text following it till line break cc: text following it till line break

i have also copied the subject to a new column

prasadpn commented 2 years ago

@asjad248 @LakshmiDadarkar @mdkamal07

i have improved upon the code to remove multiple occurrences of from:/to:/cc:/sent:

input text with multiple from: occurrences

clean text

prasadpn commented 2 years ago

@asjad248 @LakshmiDadarkar @mdkamal07

this seems to be a template which is used to collect data - it can also be cleaned further..

i guess it is the summary which will be of use

prasadpn commented 2 years ago

@asjad248 @LakshmiDadarkar @mdkamal07

i have added my latest code in the github. i have added code to remove the embedded image reference and certain other template questions.

please have a look to the output dataframe -- i am writing it in excel. Request you all to test the clean desc_analysis column with Description

asjad248 commented 2 years ago

Thanks Pankaj.

LakshmiDadarkar commented 2 years ago

Thanks Pankaj!

On Mon, Nov 29, 2021 at 3:15 PM asjad248 @.***> wrote:

Thanks Pankaj.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prasadpn/Ticket-Assignment/issues/6#issuecomment-981730664, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWURY44TYHNYVBZ435Y5RHLUOOKJVANCNFSM5IT2KIOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

prasadpn / Ticket-Assignment

Data cleaning steps #6