Closed tayfunyasar closed 7 years ago
I have the code that downloads only usernames. Current task is heavy because it make a full info request for every user in 80k.
This are my code samples, you can integrate them to your notebook:
I have wrote special methods for your task
def dump_data(users, output_filename):
try:
data = pd.DataFrame(users).drop_duplicates()
except:
# means closed acc -> save empty file
data = pd.DataFrame(columns=["username", "user_id"])
mode = "w"
data.to_csv(output_filename, mode=mode, sep="\t", header=(mode=="w"), index=False)
def save_followers_from_user(itarator, output_filename):
get_username = lambda x: x[x.rfind("/") + 11:-4]
users = []
for _user in tqdm(iterator, desc=get_username(output_filename), leave=False):
usr = {}
usr["username"] = _user["username"]
usr["user_id"] = _user["pk"]
users.append(usr)
dump_data(users, output_filename)
And the usage is abut that
iterator = get.user_followers(user_id, total=None)
save_followers_from_user(iterator, output_filename)
user_id is the id of user which followers you want to scrape
I'm trying to get a user's followers (its about 80k). its not finished yet. And it has been 4 days and still going..
How can we optimise that situation? Maybe save each 100 requests and skip saved usernames?
For workaround, how can I export only usernames into csv?