Closed prasadpn closed 2 years ago
@asjad248 @mdkamal07
i was going through the input file.......randomly checking stuffs. i could see text like this as well "to �贺,早上电脑开机开�出�"
need to search how to clean them up
@asjad248 @mdkamal07
i could also see usage of non-english text e.g.
(1) prompted at step: "auftragsausgang > von revision freigegeben in german
not sure if we have some statements in other languages as well - we may have to see how to handle them
@asjad248 @mdkamal07
this is an imbalanced dataset - there are some groups which have only 1 statement...it will be challenging....
we may have to search how to handle such scenarios
@asjad248 @mdkamal07
i would suggest not to remove the stop words for the model as at times they help in providing the context.
@asjad248 @mdkamal07
there is presence of escape characters as well in statements such as \r\n\r\nr -- we will have to handle them as well
@asjad248 @mdkamal07
for stop words ---- may be we can have some custom stop words removed....
the description column is mail body - i see Hi, hello, good morning... regards -- such words can be removed
For imbalanced dataset with Just 1 example- we can check for unique token frequency, uni grame, bi gram, trigram. If we found similar token we can group those category together. Also we can check for cosine similarity and can decide to group together those categories. - Just a thought
@asjad248 - we can explore that approach for imbalanced dataset
@asjad248 @mdkamal07
i was doing some search for the ©ä¸Šç”µè„‘开机开ä¸�出æ�¥" appearing in our dataset. What are they??
You can refer the below url for details -- it is called "mojibake" -- atleast now we know what it is called -- now i will look into how to handle them in python
https://www.kaggle.com/rtatman/data-cleaning-challenge-character-encodings
@asjad248 @mdkamal07 @LakshmiDadarkar
To handle mojibake
pip install ftfy import ftfy
cleaned_mojibake = ftfy.fix_text(text)
@asjad248 @mdkamal07 @LakshmiDadarkar
For translation one option is
pip install deep-translator from deep_translator import GoogleTranslator
translated_text = GoogleTranslator(source='auto', target='english').translate(text)
@asjad248 @LakshmiDadarkar @mdkamal07
On further investigating the data i observed that lot of Long Description column have data in the pattern
"received from: eylqgodm.ybqkwiam@gmail.com" where the portion in italics can be any mail id. Further investigation showed there are 2251 such records out of 8500 what we have
So in data cleaning part i have added a step to retrieve such email ids which follows received from: and put them in another column
i have also removed the pattern from the text we would use for analysis as i think it is not of much value in text analysis. We can leverage the new column to identify the rows which had such pattern
We can also see that there are 961 records where initially issue was triggered due to system generated mail from monitoring_tool@company.com
Good finding Pankaj !
@asjad248 @LakshmiDadarkar @mdkamal07
i also see a lot of data in this pattern - which are copies of mail.
from: to: sent: subject:
dear sir/madam
how to handle such things....i was thinking to remove such occurrences before proceeding with analysis as from, to and date wont help in further analysis. Subject might help so thinking to extract that in other column
do suggest
@asjad248 @LakshmiDadarkar @mdkamal07
from the description for analysis, i have removed
from: text following it till line break to: text following it till line break sent: text following it till line break date: text following it till line break subject: text following it till line break cc: text following it till line break
i have also copied the subject to a new column
@asjad248 @LakshmiDadarkar @mdkamal07
i have improved upon the code to remove multiple occurrences of from:/to:/cc:/sent:
input text with multiple from: occurrences
clean text
@asjad248 @LakshmiDadarkar @mdkamal07
this seems to be a template which is used to collect data - it can also be cleaned further..
i guess it is the summary which will be of use
@asjad248 @LakshmiDadarkar @mdkamal07
i have added my latest code in the github. i have added code to remove the embedded image reference and certain other template questions.
please have a look to the output dataframe -- i am writing it in excel. Request you all to test the clean desc_analysis column with Description
Thanks Pankaj.
Thanks Pankaj!
On Mon, Nov 29, 2021 at 3:15 PM asjad248 @.***> wrote:
Thanks Pankaj.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prasadpn/Ticket-Assignment/issues/6#issuecomment-981730664, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWURY44TYHNYVBZ435Y5RHLUOOKJVANCNFSM5IT2KIOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
As a ML engineer, i would like to identify the steps needed to clean the data