Open bigmit2011 opened 1 year ago
@bigmit2011 Just Try Concurrent Processing The Large Data Read Out More https://docs.python.org/3/library/concurrency.html
For concurrent processing, I highly recommend using the multiprocessing
module of pathos
(see also here).
It is essentially the same as the python standard library multiprocessing
module, except that pathos.multiprocessing
uses dill
under the hood instead of pickle
in order to avoid many of the limitations that pickle
imposes.
That said, there are potentially several sources of inefficiency here:
sample_df
data file. I strongly suspect, looking at the code, that the code may be opening and reading the same files over and over, incurring both the cost of reading from disk and of allocating memory to do so. The code appears to take only a portion of the data from each file that it opens, and if a given ticker is in sample_df
more than once, then for sure the code is opening and reading the same file multiple times while using only a portion of the file with each iteration.sample_df
file, if you can provide some of the other data files then it may be possible to identify other inefficiences in the code.mc = fplt.make_marketcolors(...)
and s = fplt.make_mpf_style(...)
can and should be done outside of the loop. Although these should be relatively quick calls, there is no reason to do them thousands of times, resulting in thousands of memory allocations and the ensuing garbage collection.mplfinance
to calculate your SMAs and EMAs instead of calculating them externally and the using addplot to plot. This will certainly make the code simpler, however one would have to experiment to see if letting mplfinance
doe the SMA and EMA calculations is faster, slower, or the same. Based on my understanding of the code, it would be truly difficult to predict, and thus one would have to actually run the test to find out.mplfinance
, rather than closing the Figure and reallocating it again and again each time through the loop (thus again avoiding a tremendous amount of object and memory allocation, and garbage collection). In order to do this, one would have to save the Figure and Axes from the very first call, and then use mplfinance in external axis mode after that, minimally clearing the Axes each time through the loop.Another point, implied above, regarding items 1 and 4 above. If indeed it is the case that a ticker may be in sample_df
multiple times, then the logical restructructuring of the code would be to create a loop within a loop, where the outer loop reads the data file filename = os.path.join(DATA_DIR, ticker + '.csv')
and calculates all of the SMAs and EMAs for that data set, and then the inner loop is over the various date ranges (startDate = pd.to_datetime(input_date).date() - relativedelta(months=month)
and endDate = pd.to_datetime(input_date).date() + relativedelta(days =1)
) for that ticker. In this way each ticker's file is read only once and the SMAs and EMAs for each ticker would also be calculated only once.
Hi,
Thank you so much for the detailed reply.
Regarding 1 and 2: So sample_df is only opened once. It's usually a dataframe with a list of tickers and dates. Most of the time the tickers are different, so I don't think that's the bottleneck. However, I will looking into reusing the data if the ticker is the same. I will try to copy and paste some of the sample_df, when I get back to my home pc.
The historical data files depend on the ticker. But it's basically the entire historical data of the ticker:
https://finance.yahoo.com/quote/AAPL/history/
3) This makes sense. I will give this a go.
4) Yeah the calculations are in there due to simplicity.
5) This one is a little confusing for me, but I will try to see if I can find some examples.
6) So I'm a little far for what I want to accomplish, but if there are 12k tickers, I want to be able to run pattern recognition on them daily. I haven't tested out the latest computer vision models for accurate I can get them for pattern recognition with something like transfer learning (if that's still the best way), but that's the goal.
`
@bigmit2011 too achieved try store historical in a local database or cached file format. keep updating the last candle open, high, low close data which significantly reduces request time from Yahoo data. for Better performance, I like to suggest please use API
@bigmit2011 too achieved try store historical in a local database or cached file format. keep updating the last candle open, high, low close data which significantly reduces request time from Yahoo data. for Better performance, I like to suggest please use API
Hi,
I actually have data saved locally as csv files and am not scraping during the time of creating charts.
I recently wrote some code for saving mplfinance chart images to disk using concurrent processing.
https://github.com/BennyThadikaran/stock-pattern/blob/main/src/init.py#L107
Below is the main outline of the code using concurrent.futures module. It assumes your data is already on disk. If using network to download the data see the second part.
import concurrent.futures
import mplfinance as mpf
import matplotlib.pyplot as plt
def process(sym):
"""This runs in a child process"""
# load the file in a DataFrame, do some processing
df = pd.read_csv("symfile.csv")
# switch to non interactive backend when working inside child process
plt.switch_backend("AGG")
plt.ioff()
mpf.plot(df, type="candle", style="tradingview", savefig=f"{sym}.png")
# return something usefull
return f"success {sym}"
def main():
"""Main entry point of script"""
futures = []
sym_list = ["tcs", "infosys"] # your fairly long list of symbols
with concurrent.futures.ProcessPoolExecutor() as executor:
for sym in sym_list:
# Pass process function and any additional
# positional arguments and keyword arguments to executor.submit
future = executor.submit(process, sym)
futures.append(future)
for future in concurrent.futures.as_completed(futures):
# do something with the result
print(future.result())
if __name__ == "__main__":
# run the script
main()
If you're making network requests for stock data, you can get a big performance boost using asyncio (stdlib) from and aiohttp (external package).
The benefit is not having to wait for each stocks data to be downloaded. With the asyncio.as_completed
, you can begin processing responses and saving the images in a child process. Below is a very simplified script, demonstrating the crucial bits.
Make sure to use a throttler or you will exceed the server api limits.
async def main():
sym_list = [] # your symbol list
async with aiohttp.ClientSession() as session:
tasks = []
for sym in sym_list:
# call your data fetch function with create_task
# data_fetch takes the sym and session argument and calls session.get(url)
task = asyncio.create_task(data_fetch(sym, session))
tasks.append(task)
loop = asyncio.get_event_loop()
executor = concurrent.futures.ProcessPoolExecutor
futures_list = []
async for futures_completed in asyncio.as_completed(tasks):
stock_data = futures_completed.result()
futures = loop.run_in_executor(executor, process, stock_data)
futures_list.append(futures)
results = await asyncio.gather(*futures_list)
if __name__ == "__main__":
# run the script
asyncio.run(main())
I am using a friends script (so I don't know all the details in this script), but I wonder if there are simple tricks I can do here to make it save quicker. I want to be able to save around 10k images.
I plan to incorporate multiprocessing to make it even quicker.
Thank you.