rohitmanglik / scrap_careers

1 stars 6 forks source link

Reformatting #8

Open rohitmanglik opened 7 years ago

rohitmanglik commented 7 years ago

Current way of representing data using " '<', '>', '#' " is no more required. We migrated to more powerful technologies to meet our goals.

Here is the new format:

Facilities, Images and Videos: Comma separated values

Courses, Top following states, Admission mode, Geometric Insights: Split them into more columns as anyway we will have to do it in later stage. If splitting is not possible (in case the number of elements is not fixed), serialize the values.

Split Average Age into two columns

gupta-shantanu commented 7 years ago

@rohitmanglik How should the fields, such as Notable Alumni, Admission Mode and Courses be formatted. They contain multiple entries with same field. In the database they would probably have different tables of their own. Currently I have serialized them as list of tuples. e.g. Field Admission Mode (name,type,level) for some college can be (JEE Advanced,U.G , National Level Exam/Admission),(GATE,P.G , National Level Exam/Admission),(CEED,P.G , National Level Exam/Admission)

rohitmanglik commented 7 years ago

Serialize the data using http://www.php2python.com/wiki/function.serialize/

The reason we are not going for Pickle or JSON is that our consuming API is in PHP and we anyway have to convert data to a format that can be easily parse.

rohitmanglik commented 7 years ago

e.g. < Exam : SRMJEEE # Type : U.G # Level : University Level Exam/Admission >< Exam : NATA # Type : U.G # Level : National Level Exam/Admission >< Exam : SRMGEET # Type : P.G # Level : University Level Exam/Admission >< Exam : GATE # Type : P.G # Level : National Level Exam/Admission >< Exam : TANCET # Type : P.G # Level : State Level Exam/Admission >

You can easily convert it into following format

{ 0 :{ 'exam': 'SRMJEE', 'type': 'U.G.', 'level': 'Universite Level Exam/Admission' }, 1 : { and so on and so forth. }

and then serialize it.

rohitmanglik commented 7 years ago

Last commit had errors (excel file was empty) so reopening this issue.

gupta-shantanu commented 7 years ago

It seems courses column exceeds excel limit of max characters. We can split each course entry into columns (but then each row entry would have different number of columns). Alternatively we can write course details in a new file with an extra column used to identify which college it belongs to. @alchem9st any suggestions?

rohitmanglik commented 7 years ago

@alchem9st when you ran the scrapper last time, what is the maximum number of courses? If it's within 50, we can go for splitting it into multiple files.

Also for worst case scenario (i.e. in exceptional cases when the number of chars are too long after splitting course column into multiple columns), we need to strip content to max chars in cell. We have to be fault tolerant as after running this progra for weeks in background, at the end we should not see empty result.

@gupta-shantanu new file is not an option, it's an underlying limitation on our underlying import library.

rohitmanglik commented 7 years ago

I checked latest file (it has two records): currently, comma is converted to dollar: revert this.

BssMsi commented 7 years ago

Can you please help me with this error? Thanks ''' Traceback (most recent call last): File "D:/Programming/Python/scrap_careers-master/scrap_careers.py", line 23, in display = Display(visible=0, size=(1024, 768)) File "C:\Anaconda\lib\site-packages\pyvirtualdisplay\display.py", line 34, in init self._obj = self.display_class( File "C:\Anaconda\lib\site-packages\pyvirtualdisplay\display.py", line 52, in display_class cls.check_installed() File "C:\Anaconda\lib\site-packages\pyvirtualdisplay\xvfb.py", line 38, in check_installed ubuntu_package=PACKAGE).check_installed() File "C:\Anaconda\lib\site-packages\easyprocessinit.py", line 180, in check_installed raise EasyProcessCheckInstalledError(self) easyprocess.EasyProcessCheckInstalledError: cmd=['Xvfb', '-help'] OSError=[Error 2] The system cannot find the file specified Program install error!

Process finished with exit code 1 '''

rohitmanglik commented 7 years ago

@gupta-shantanu would you like to answer?

Also please update documentation for this issue.

pinakinathc commented 7 years ago

@BssMsi I was trying to execute line 6 "from selenium import selenium" but it did not. I may have missed to installed to some packages though I have installed "selenium". Please help.

dimritium commented 7 years ago

@BssMsi please install xvfb, for linux it is sudo apt-get install xvfb please see for windows