minimaxir / download-tweets-ai-text-gen

Python script to download public Tweets from a given Twitter account into a format suitable for AI text generation.
MIT License
219 stars 41 forks source link

Updated to support loading tweets from multiple accounts #14

Closed sdelgadoc closed 4 years ago

sdelgadoc commented 4 years ago

Thanks for all your work to make GPT-2 easier to work with!

The tweet downloading script has been great, but didn't give me enough data to build a robust 100MB+ model.

I updated the script to download tweets from multiple accounts using a text file.

The script works very similarly, other than if you pass a .txt file as the username parameter. If you do, it will download tweets from all the accounts in the file, and not a single account.

With this, one can build a model for a 'type' of tweeter made up of multiple accounts. As an example, I am including files of the Republican and Democratic leadership accounts.

If you want to merge into master, I'm happy to send another pull request updating the README so folks know how the functionality works.

DeFiDude commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

sdelgadoc commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though.

I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue?

download_tweets.py small_test.txt

small_test.txt

DeFiDude commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though.

I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue?

download_tweets.py small_test.txt

small_test.txt

It's working just fine now, thanks!

I tipped your Github with some BAT.

sdelgadoc commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though. I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue? download_tweets.py small_test.txt small_test.txt

It's working just fine now, thanks!

I tipped your Github with some BAT.

Glad to hear it worked for you! Let us know what you learn playing around with it.

And thanks for the BAT. People tip you for finding your bugs; what a time to be alive! :-)

DeFiDude commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though. I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue? download_tweets.py small_test.txt small_test.txt

It's working just fine now, thanks! I tipped your Github with some BAT.

Glad to hear it worked for you! Let us know what you learn playing around with it.

And thanks for the BAT. People tip you for finding your bugs; what a time to be alive! :-)

Is there a way to message you on Github, or do you have a contact e-mail? I am trying to message you in regards to this, but I don't want to clutter a GPT-2 Github timeline (I'm a Github noob).

Thanks!

sdelgadoc commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though. I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue? download_tweets.py small_test.txt small_test.txt

It's working just fine now, thanks! I tipped your Github with some BAT.

Glad to hear it worked for you! Let us know what you learn playing around with it. And thanks for the BAT. People tip you for finding your bugs; what a time to be alive! :-)

Is there a way to message you on Github, or do you have a contact e-mail? I am trying to message you in regards to this, but I don't want to clutter a GPT-2 Github timeline (I'm a Github noob).

Thanks!

Happy to discuss here.

DeFiDude commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though. I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue? download_tweets.py small_test.txt small_test.txt

It's working just fine now, thanks! I tipped your Github with some BAT.

Glad to hear it worked for you! Let us know what you learn playing around with it. And thanks for the BAT. People tip you for finding your bugs; what a time to be alive! :-)

Is there a way to message you on Github, or do you have a contact e-mail? I am trying to message you in regards to this, but I don't want to clutter a GPT-2 Github timeline (I'm a Github noob). Thanks!

Happy to discuss here.

Great, well it's not specifically to do with the Python tweet grabber, but a problem I'm having afterwards - after I've generated the CSV and plug it into GPT-2-Simple, I manage to get an index out of range error.

image

It seems to snag after trying to restore 355M model perimeters?

Appreciate any help.

sdelgadoc commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though. I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue? download_tweets.py small_test.txt small_test.txt

It's working just fine now, thanks! I tipped your Github with some BAT.

Glad to hear it worked for you! Let us know what you learn playing around with it. And thanks for the BAT. People tip you for finding your bugs; what a time to be alive! :-)

Is there a way to message you on Github, or do you have a contact e-mail? I am trying to message you in regards to this, but I don't want to clutter a GPT-2 Github timeline (I'm a Github noob). Thanks!

Happy to discuss here.

Great, well it's not specifically to do with the Python tweet grabber, but a problem I'm having afterwards - after I've generated the CSV and plug it into GPT-2-Simple, I manage to get an index out of range error.

image

It seems to snag after trying to restore 355M model perimeters?

Appreciate any help.

I'll start by saying that I'm not a gpt2-simple expert, but I can point you in the right direction.

I was not able to reproduce your error, so let me walk you through what I did, and you can see how it compares to your code.

This article was what I used to learn about gpt2-simple. In the article, the author references a Google Colab Notebook with out-of-the-box code to run gpt2-simple using the output of this script.

Using the tweet file generate using small_test.txt above, I ran the Notebook, and was able to train the model, and generate text without any issues.

I would recommend you take a look at the Notebook and compare to your code to find differences.

DeFiDude commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though. I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue? download_tweets.py small_test.txt small_test.txt

It's working just fine now, thanks! I tipped your Github with some BAT.

Glad to hear it worked for you! Let us know what you learn playing around with it. And thanks for the BAT. People tip you for finding your bugs; what a time to be alive! :-)

Is there a way to message you on Github, or do you have a contact e-mail? I am trying to message you in regards to this, but I don't want to clutter a GPT-2 Github timeline (I'm a Github noob). Thanks!

Happy to discuss here.

Great, well it's not specifically to do with the Python tweet grabber, but a problem I'm having afterwards - after I've generated the CSV and plug it into GPT-2-Simple, I manage to get an index out of range error. image It seems to snag after trying to restore 355M model perimeters? Appreciate any help.

I'll start by saying that I'm not a gpt2-simple expert, but I can point you in the right direction.

I was not able to reproduce your error, so let me walk you through what I did, and you can see how it compares to your code.

This article was what I used to learn about gpt2-simple. In the article, the author references a Google Colab Notebook with out-of-the-box code to run gpt2-simple using the output of this script.

Using the tweet file generate using small_test.txt above, I ran the Notebook, and was able to train the model, and generate text without any issues.

I would recommend you take a look at the Notebook and compare to your code to find differences.

That’s what’s strange, I’m using the collab notebook as well, and run into the issue. I’ve started from scratch almost a dozen times, and it seems a small handful of people are getting the same error - so it must be something being overlooked or missed, but I’m following the directions exactly as stated. I’ll try it out again tonight with the small_test.txt and note my steps and see if there’s an error I’m making.

Thanks for the reply.

sdelgadoc commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though. I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue? download_tweets.py small_test.txt small_test.txt

It's working just fine now, thanks! I tipped your Github with some BAT.

Glad to hear it worked for you! Let us know what you learn playing around with it. And thanks for the BAT. People tip you for finding your bugs; what a time to be alive! :-)

Is there a way to message you on Github, or do you have a contact e-mail? I am trying to message you in regards to this, but I don't want to clutter a GPT-2 Github timeline (I'm a Github noob). Thanks!

Happy to discuss here.

Great, well it's not specifically to do with the Python tweet grabber, but a problem I'm having afterwards - after I've generated the CSV and plug it into GPT-2-Simple, I manage to get an index out of range error. image It seems to snag after trying to restore 355M model perimeters? Appreciate any help.

I'll start by saying that I'm not a gpt2-simple expert, but I can point you in the right direction. I was not able to reproduce your error, so let me walk you through what I did, and you can see how it compares to your code. This article was what I used to learn about gpt2-simple. In the article, the author references a Google Colab Notebook with out-of-the-box code to run gpt2-simple using the output of this script. Using the tweet file generate using small_test.txt above, I ran the Notebook, and was able to train the model, and generate text without any issues. I would recommend you take a look at the Notebook and compare to your code to find differences.

That’s what’s strange, I’m using the collab notebook as well, and run into the issue. I’ve started from scratch almost a dozen times, and it seems a small handful of people are getting the same error - so it must be something being overlooked or missed, but I’m following the directions exactly as stated. I’ll try it out again tonight with the small_test.txt and note my steps and see if there’s an error I’m making.

Thanks for the reply.

The last time I got that error it was due to two things. 1) the file was incomplete so the CSV reader choked, 2) the file was not a CSV, so the CSV reader choked. I think your issue is most likely 1). So, try the attached tweet file and see if that fixes the issue for you. small_test.txt_tweets.zip

DeFiDude commented 4 years ago

When using multiple usernames, it downloads the first name fine then waits until timeout on the second username, regardless of the names and ends. Any idea why?

Thanks for your feedback DeFiDude, I ran the file, and found a couple of bugs, which I fixed in the latest update. I wasn't able to reproduce your issue though. I have tested the script with the command, and file below, and it worked without issues. Can you try it and let me know if you can reproduce the issue? download_tweets.py small_test.txt small_test.txt

It's working just fine now, thanks! I tipped your Github with some BAT.

Glad to hear it worked for you! Let us know what you learn playing around with it. And thanks for the BAT. People tip you for finding your bugs; what a time to be alive! :-)

Is there a way to message you on Github, or do you have a contact e-mail? I am trying to message you in regards to this, but I don't want to clutter a GPT-2 Github timeline (I'm a Github noob). Thanks!

Happy to discuss here.

Great, well it's not specifically to do with the Python tweet grabber, but a problem I'm having afterwards - after I've generated the CSV and plug it into GPT-2-Simple, I manage to get an index out of range error. image It seems to snag after trying to restore 355M model perimeters? Appreciate any help.

I'll start by saying that I'm not a gpt2-simple expert, but I can point you in the right direction. I was not able to reproduce your error, so let me walk you through what I did, and you can see how it compares to your code. This article was what I used to learn about gpt2-simple. In the article, the author references a Google Colab Notebook with out-of-the-box code to run gpt2-simple using the output of this script. Using the tweet file generate using small_test.txt above, I ran the Notebook, and was able to train the model, and generate text without any issues. I would recommend you take a look at the Notebook and compare to your code to find differences.

That’s what’s strange, I’m using the collab notebook as well, and run into the issue. I’ve started from scratch almost a dozen times, and it seems a small handful of people are getting the same error - so it must be something being overlooked or missed, but I’m following the directions exactly as stated. I’ll try it out again tonight with the small_test.txt and note my steps and see if there’s an error I’m making. Thanks for the reply.

The last time I got that error it was due to two things. 1) the file was incomplete so the CSV reader choked, 2) the file was not a CSV, so the CSV reader choked. I think your issue is most likely 1). So, try the attached tweet file and see if that fixes the issue for you. small_test.txt_tweets.zip

Huh, that's quite weird - I went through it again only using small_test for the CSV, and it worked just fine.

This of course points to something being wrong with my CSV, but I don't know what it could possibly be.

At first glance, I noticed small_test does not have blank lines in between each tweet, however all of my personal tests that have failed seem to have blank lines in between each tweet.

I'm not sure if this is the reason, will test it now but I can't imagine it'd be the problem considering it's the default CSV that was pushed out using your/maxi's tweet to CSV script.

So by that, my only guess is that one of the Twitter accounts I'm grabbing from has some sort of tweet that somehow breaks the CSV/CSV Reader? Though I have no idea how I would be able to figure out which one considering it's 40+ accounts and 80,000+ tweets. It'd be a long day testing each account individually.

DeFiDude commented 4 years ago

I removed all blank lines in the CSV, and it seems to be training now on my CSV that previously threw errors.

I think we are in the clear now!

DeFiDude commented 4 years ago

Is it normal for \n's to show up all over the place in the exported tweets? Example:

I know that it's replacing the line breaks, but when teaching the AI - it seems to also include \n's in some of the outputs.

sdelgadoc commented 4 years ago

Is it normal for \n's to show up all over the place in the exported tweets? Example:

I know that it's replacing the line breaks, but when teaching the AI - it seems to also include \n's in some of the outputs.

The model will be trained based on the data you give it. If your data has a lot of \n's, the output will have a lot of \n's.

The script does not escape \n's, so it includes the \n's found in the tweets.

So, you will see \n's in the output, and if you want to turn them into a carriage return characters, you will have to do so.

DeFiDude commented 4 years ago

Thank you! And one final question - is there any reference that has suggestions for how many steps to use based on my dataset?

I've got ~8MB in data (3.2M tokens), though Minimaxir warns to use 200-500 steps if you don't have a lot of data. 8MB is more than the average Twitter account (100k+ tweets), so I'm not sure if that means 2,000 steps is acceptable, or higher/lower.

Thanks for all the help again. I've tipped you some BAT again.

sdelgadoc commented 4 years ago

Minimaxir

Thank you! And one final question - is there any reference that has suggestions for how many steps to use based on my dataset?

I've got ~8MB in data, though Minimaxir warns to use 200-500 steps if you don't have a lot of data. 8MB is more than the average Twitter account (100k+ tweets), so I'm not sure if that means 2,000 steps is acceptable, or higher/lower.

Thanks for all the help again. I've tipped you some BAT again.

My live models have 20k - 30k steps, so 2,000 seems low to me. I have not seen any good data on this, I'm actually working on some surveys to help tease this out. I'll ask for your help in collecting data once I'm done.

If you have any BAT burning a hole in your pocket, I'd also tip Minimaxir. He's the one that's done most of the work for these tools. 👍

DeFiDude commented 4 years ago

Minimaxir

Thank you! And one final question - is there any reference that has suggestions for how many steps to use based on my dataset? I've got ~8MB in data, though Minimaxir warns to use 200-500 steps if you don't have a lot of data. 8MB is more than the average Twitter account (100k+ tweets), so I'm not sure if that means 2,000 steps is acceptable, or higher/lower. Thanks for all the help again. I've tipped you some BAT again.

My live models have 20k - 30k steps, so 2,000 seems low to me. I have not seen any good data on this, I'm actually working on some surveys to help tease this out. I'll ask for your help in collecting data once I'm done.

If you have any BAT burning a hole in your pocket, I'd also tip Minimaxir. He's the one that's done most of the work for these tools. 👍

Great to know. I recently saw a Reddit comment from Minimaxir saying that basically anything above 1MB doesn't seem to overfit, so I'll go with way more steps.

That's a good point, Minimaxir is getting some BAT too!

minimaxir commented 4 years ago

Hi, sorry for the late response to this thread.

That list index is out of range is a data input issue, not a model training issue.

8 MB is "a lot of data"; when I say not a lot of data, I mean < 1 MB, which is common if getting tweets from a single year. Either way, the outcome will become obvious as the model will overfit if trained for too many steps without a lot of data.

In the future, questions about gpt-2-simple are best put in the gpt-2-simple repo.

No BAT donation is necessary, but thanks for the offer! :)

minimaxir commented 4 years ago

This PR looks good: sorry for the delay!